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PREFACE 

These are lecture notes for the courses "Tijdreeksen" , "Time Series" and "Financial 
Time Series" . The material is more than can be treated in a one-semester course. See 
next section for the exam requirements. 

Parts marked by an asterisk "*" do not belong to the exame requirements. 

Exercises marked by a single asterisk "*" are either hard or to be considered of 
secondary importance. Exercises marked by a double asterisk "**" are questions to which 
I do not know the solution. 

Amsterdam, 1995-2001 (revisions, extensions), 

A.W. van der Vaart 



EXAM 

The take-home exam exists of handing in solutions to the following problems listed in 
the text. (Note: the numbering of the problems may change over time. The numbers 
mentioned on this page refer to the numbering in this particular version. Don't use them 
with earlier or later versions of the lecture notes.) 

1.11, 1.14, 1.28, 4.9, 5.2, 5.13, 6.8, 6.13, 6.16, 6.17, 7.12, 7.19, 7.20,7.24, 7.31,8.1, 
8.9, 8.16, 9.12, 10.18, 12.3, 12.4. 

If you find this too much work, remember that you can also do everything at one 
go at the end of the course. 



LITERATURE 

The following list is a small selection of books on time series analysis. Azencott/Dacunha- 
Castelle and Brockwell/Davis are close to the core material treated in these notes. The 
first book by Brockwell/Davis is a standard book for graduate courses for statisticians. 
Their second book is prettier, because it lacks the overload of formulas and computations 
of the first, but is of a lower level. 

Chatfield is less mathematical, but perhaps of interest from a data-analysis point 
of view. Hannan and Deistler is tough reading, and on systems, which overlaps with 
time series analysis, but is not focused on statistics. Hamilton is a standard work used 
by econometricians; be aware, it has the existence results for ARMA processes wrong. 
Brillinger's book is old, but contains some material that is not covered in the later works. 
Rosenblatt's book is new, and also original in its choice of subjects. Harvey is a proponent 
of using system theory and the Kalman filter for a statistical time series analysis. His 
book is not very mathematical, and a good background to state space modelling. 

Most books lack a treatment of developments of the last 10-15 years, such as 
GARCH models, stochastic volatility models, or cointegration. Mills and Gourieroux 
fill this gap to some extent. The first contains a lot of material, including examples fit- 
ting models to economic time series, but little mathematics. The second appears to be 
written for a more mathematical audience, but is not completely satisfying. For instance, 
its discussion of existence and stationarity of GARCH processes is incomplete, and the 
presentation is mathematically imprecise at many places. 

An alternative to these books are several review papers on volatility models, such 
as Bollerslev et al., Ghysels et al., and Shepard . Besides introductory discussion, also 
inclusing empirical evidence, these have extensive lists of references for further reading. 

The book by Taniguchi and Kakizawa is unique in its emphasis on asymptotic theory, 
including some results on local asymptotic normality. It is valuable as a resource. 
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Introduction 



1.1 Basic Definitions 

In this course a stochastic time series is a doubly infinite sequence 

■ ■ ■ , X-2 , X-i , Xq ,X\,X-z,... 

of random variables or random vectors. (Oddly enough a time series is a mathematical 
sequence, not a series.) We refer to the index t of X t as time and think of X t as the 
state or output of a stochastic system at time £, even though this is unimportant for the 
mathematical theory that we develop. Unless stated otherwise, the variable X t is assumed 
to be real-valued, but we shall also consider series of random vectors and complex valued 
variables. We write "the time series X t v rather than using the more complete (X t :t € Z). 
Instead of "time series" we may also use "process" or "stochastic process" . 

Of course, the set of random variables X t , and other variables that we may introduce, 
are defined as measurable maps on some underlying probability space. We only make 
this more formal if otherwise there could be confusion, and then denote this probability 
space by (Q,U, P). 

Time series theory is a mixture of probabilistic and statistical concepts. The proba- 
bilistic part is to study and characterize probability distributions of sets of variables X t 
that will typically be dependent. The statistical problem is to characterize the probabil- 
ity distribution of the time series given observations X\, . . . , X n at times 1, 2, . . . , n. The 
resulting stochastic model can be used in two ways: 

- understanding the stochastic system; 

- predicting the "future", i.e. X„+i, J n +2, ■ ■ ■ , 

In order to have any chance of success it is necessary to assume some a-priori structure 
of the time series. Indeed, if the X t could be completely arbitrary random variables, then 
(Xi, . . . , X n ) would constitute a single observation from an arbitrary distribution on W 1 . 
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Conclusions about this distribution would be impossible, let alone about the distribution 
of the future values X n +i,X n +2, .... 

A basic type of structure is stationarity. This comes in two forms. 

1.1 Definition. The time series X t is strictly stationary if the distribution (on M. h+1 ) 
of the vector (X t ,X t +i, . . . , X t +h) is independent oft, for every h £ N. 

1.2 Definition. The time series Xt is stationary (or more precisely second order sta- 
tionary) if EX t and EX t +hX t exist and are finite and do not depend on t, for every 

h e N. 

It is clear that a strictly stationary time series with finite second moments is also 
stationary. For a stationary time series the auto-covariance and auto- correlation at lag 
h € Z are defined by 

lx{h) =cov(X t+ h,X t ), 

p x (h)=p{X t+h ,X t ) = 1 -^§-. 

The auto-covariance and auto-correlation are functions on Z that together with the mean 
H = EX t determine the first and second moments of the stationary time series. Note that 
7x(0) = vaiXt is the variance of X t and px(0) = 1- 

1.3 Example (White noise). A doubly infinite sequence of independent, identically 
distributed random variables X t is a strictly stationary time series. Its auto-covariance 
function is, with o~ 2 = var Xt, 

,,,. |ct 2 , if/i = 0, 

Any time series X t with mean zero and covariance function of this type is called a 
white noise series. Thus any mean-zero i.i.d. sequence with finite variances is a white 
noise series. The converse is not true: there exist white noise series' that are not strictly 
stationary. 

The name "noise" should be intuitively clear. We shall see why it is called "white" 
when discussing spectral theory of time series in Chapter 6. 

White noise series are important building blocks to construct other series, but from 
the point of view of time series analysis they are not so interesting. More interesting are 
series where the random variables are dependent, so that, to a certain extent, the future 
can be predicted from the past. □ 



1.4 EXERCISE. Construct a white noise sequence that is not strictly stationary. 

1.5 Example (Deterministic trigonometric series). Let A and B be given, uncorre- 
cted random variables with mean zero and variance <r 2 , and let A be a given number. 
Then 

X t = Acos(t\) +Bsin(t\) 
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Figure 1.1. Realization of a Gaussian white noise series of length 250. 

defines a stationary time series. Indeed, EX t = and 

lx{h) =cov(X t+ h,X t ) 

= cos((£ + h)X) cos(iA) var A + sin((£ + h)X) sin(iA) varf? 
= <j 2 cos(/iA). 

Even though A and B are random variables, this type of time series is called deterministic 
in time series theory. Once A and B have been determined (at time — oo say), the process 
behaves as a deterministic trigonometric function. This type of time series is an important 
building block to model cyclic events in a system, but it is not the typical example of a 
statistical time series that we study in this course. Predicting the future is too easy in 
this case. □ 

1.6 Example (Moving average). Given a white noise series Z t with variance a 2 and 
a number 6 set 



X t =Zt+ OZt- 



t-i- 
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This is called a moving average of order 1. The series is stationary with EX t = and 

(1 + <9 2 )ct 2 , if/i = 0, 
jx(h) = cov(Z t+h + dZ t+h -i,Zt + OZt-i) = {6a 2 , if h = ±1, 

0, otherwise. 

Thus X s and X t are uncorrelated whenever s and t are two or more time instants apart. 
We speak of short range dependence and say that the time series has short memory. 

If the Zt are an i.i.d. sequence, then the moving average is strictly stationary. 

A natural generalization are higher order moving averages of the form Xt = Zt + 

Q\Zt-\ + ■ ■ ■ + 6qZt-q. □ 
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Figure 1.2. Realization of length 250 of the moving average series Xt = Zt — O.bZt—i for Gaussian white 
noise Zt ■ 

1.7 EXERCISE. Prove that the series X t in Example 1.6 are strictly stationary if Z t is 
a strictly stationary sequence. 

1.8 Example ( Autoregression) . Given a white noise series Zt with variance o~ 2 consider 
the equations 

Xt = 0X t -i+Zt, tez. 
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The white noise series is defined on some probability space (Q,U, P) and we consider 
the equation as "pointwise in w". This equation does not define X t , but in general has 
many solutions. Indeed, we can define the sequence Z t and the variable Xq in some 
arbitrary way on the given probability space and next define the remaining variables 
X t for t € Z \ {0} by the equation. However, suppose that we are only interested in 
stationary solutions. Then there is either no solution or a unique solution, depending on 
the value of 6, as we shall now prove. 

Suppose first that \6\ < 1. By iteration we find that 

X t = 6{6X t -2 + Zt-i) + Z t = ■ ■ ■ 

= Xt-k + 8 Zt-k+i + ■ ■ ■ + 6Zt-i + Zt. 

For a stationary sequence X t we have that E(6 k X t -k) 2 = O^EXq ->■ as k ->■ oo. This 
suggests that a solution of the equation is given by the infinite series 

oo 

Xt = z t + ezt-i +e 2 z t -2 + --- = Y^ s j z t -j. 

3=0 

We show below in Lemma 1.27 that this series converges almost surely, so that the 
preceding display indeed defines some random variable X t . This is a moving average of 
infinite order. We can check directly, by substitution in the equation, that X t satisfies the 
auto-regressive relation. (For every w for which the series converges; hence only almost 
surely. We shall consider this as good enough.) 

If we are allowed to change expectations and infinite sums, then we see that 

oo 

EX t = ^2e j EZt-j = 0, 

3=0 

OO OO OO nfa 

ix{h) = "£Y,w j vZt +h -*Zt-3 = Y. 6h+363 ° 2 = Y^¥° 2 - 

i=0 3=0 3=0 

We prove the validity of these formulas in Lemma 1.27. It follows that X t is indeed a 
stationary time series. In this case ^x{h) ^ for every h, so that every pair X s and X t 
are dependent. However, because "fx{h) — > at exponential speed as h — ¥ oo, this series 
is still considered to be short-range dependent. Note that "fx{h) oscillates if 6 < and 
decreases monotonely if 6 > 0. 

For 6 = 1 the situation is very different: no stationary solution exists. To see this 
note that the equation obtained before by iteration now takes the form, for k = t, 

Xt=X + Z 1 + --- + Z t . 

This implies that vai(X t — Xq) = to 1 — > oo as t — > oo. However, by the triangle inequality 
we have that 

sd(X t - X ) < sd X t + sd X = 2 sd X , 

for a stationary sequence X t . Hence no stationary solution exists. The situation for 6 = 1 
is characterized as explosive: the randomness increases significantly as t — > oo due to the 
introduction of a new Zt for every t. 
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The cases 6 = — 1 and \6\ > 1 are left as an exercise. 

The auto-regressive time series of order one generalizes naturally to auto-regressive 
series of the form X t = (f>iX t -i + ■ ■ ■ (f> p X t - p + Z t . The existence of stationary solutions 
X t to this equation is discussed in Chapter 7. □ 

1.9 EXERCISE. Consider the cases 6 = -1 and \6\ > 1. Show that in the first case there 
is no stationary solution and in the second case there is a unique stationary solution. 
(For \6\ > 1 mimic the argument for \6\ < 1, but with time reversed: iterate X t -i = 

(i/e)x t - Zt/e.) 
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Figure 1.3. Realization of length 250 of the stationary solution to the equation X t = 0.5X t _i +0.2X t _i -\-Z t 
for Z t Gaussian white noise. 



1.10 Example (GARCH). A time series X t is called a GARCH(1, 1) process if, for 
given nonnegative constants a, 9 and <f>, and a given i.i.d. sequence Z t with mean zero 
and unit variance, it satisfies a system of equations of the form 

o*=a + 4>o?_ 1 +6X?_ 1 , 

X t = o- t Z t . 
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Figure 1.4. Realization of a random walk Xt = Zt ~+~ " " " ~+~ Zq OI length 250 for Zt Gaussian white noise. 



We shall see below that for < 6 + <\> < 1 there exists a unique stationary solution X t to 
these equations and this has the further properties that of is a measurable function of 
Xt-i, X t -2, ■ ■ ■ , and that Z t is independent of these variables. The latter two properties 
are usually also included in the requirements for an GARCH series. They imply that 

EX t = Ea t EZ t = 0, 
EX s X t = E(X s a t )EZ t = 0, (s < t). 

Therefore, a stationary GARCH process with 6 + cf> £ [0, 1) is a white noise process. 
However, it is not an i.i.d. process, unless 6 + cf> = 0. Because Z t is independent of 

Xt-l, Xt-2, ■ ■ -, 

E{Xt\X t -i,X t -2, ■■■)= otEZ t = 0, 
E(X?\ X t -!, X t -2, ■■■)= o\EZl = al 

The first equation shows that X t is a "martingale difference series" . The second exhibits 
of as the conditional variance of X t given the past. By assumption of is dependent on 
X t -! and hence the time series X t is not i.i.d.. 

The abbreviation GARCH is for "generalized auto-regressive conditional het- 
eroscedasticity" : the conditional variances are not i.i.d., and depend on the past through 
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an auto-regressive scheme. Typically, the conditional variances are not directly observed, 
but must be inferred from the observed sequence X t . 

Because the conditional mean of X t given the past is zero, a GARCH process will 
fluctuate around the value 0. A large deviation from at time t — 1 will cause a large 
conditional variance of = a + 0Xl_ x + <\>a'l_ 1 , and then the deviation of X t = atZt from 
will tend to be large as well. Similarly, small deviations from will tend to be followed 
by other small deviations. Thus a GARCH process will alternate between periods of big 
fluctuations and periods of small fluctuations. This is also expressed by saying that a 
GARCH process exhibits volatility clustering, a process being "volatile" if it fluctuates 
a lot. Volatility clustering is commonly observed in time series of stock returns. The 
GARCH(1, 1) process has become a popular model for such time series. 

The signs of the X t are equal to the signs of the Z t and hence will be independent 
over time. 

Being a white noise process, a GARCH process can itself be used as input in an- 
other scheme, such as an auto-regressive or a moving average series. There are many 
generalizations of the GARCH process as introduced here. In a GARCH(p, q) process of 
is allowed to depend on of_ 1; . . . , af_ p and Xf_ x , . . . , Xf_ q . A GARCH (0, q) process is 
also called an ARCH process. The rationale of using the squares X? appears to be mostly 
that these are nonnegative and simple; there are many variations using other functions. 

As in the case of the auto-regressive relation, the two GARCH equations do not 
define the time series Xt, but must be complemented with an initial value, for instance 
CTq if we are only interested in the process for t > 0. Alternatively, we may "define" this 
initial value implicitly by requiring that the series X t be stationary. We shall now show 
that a stationary solution exists, and is unique given the sequence Z t . 

By iterating the GARCH relation we find that, for every n > 0, 

a t 2 = a + (</> + OZtM-i =a + a Y,(<t> + 0%-i) ■ ■ ■ {4> + 0Z*_j) 

+ {</> + ezU) ■ ■ ■ (<t> + ez 2 t _ n -i)ot- n -i ■ 

The sequence (^((f>+6Zf_ 1 ) ■ ■ ■ ((f>+6Zf_ n _ 1 ))'^_ 1 , being nonnegative with mean ((f>+6) n+1 , 
converges in probability to zero if 6 + <j> < 1. If the time series X t is stationary, then it is 
bounded in L2 and hence so is the sequence {o~f_ n _ 1 )^L 1 . Combined this shows that the 
term on the far right converges to zero in probability as n — ► 00. Thus for a stationary 
solution X t we must have 

00 
(1.1) o 2 t =a + a Y^{4> + 0Z?_i) ■ ■ ■ {4> + OZtj). 

Because the series Z t is assumed i.i.d., the variable Z t is independent of <Jt-\i a t-2i • • • 

and also of X t -i = at-iZ t -i,X t -2 = ot-iZt-i, In addition it follows that the 

time series X t = atZ t is strictly stationary, being a fixed measurable transformation of 
(Z t ,Z t -i,...) for every t. 

The infinite sum in (1.1) converges in mean if 6 + <j> < 1 (Cf. Lemma 1.25). Given 
the series Z t we can define a process X t by first defining the conditional variance of by 
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the preceding display, and next setting X t = a t Z t . It can be verified by substitution that 
the process X t solves the GARCH relationship and hence a stationary solution to the 
GARCH equations exists if <\> + 8 < 1. 

By iterating the auto- regressive relation a 2 = <\>a\_ x + W t , with W t = a + 8X 2 _ ± , 
in the same way as in Example 1.8, we also find that for the stationary solution a 2 = 
Sfco ftWt-j. Hence a t is a{X t -\, X t -2, ■ ■ .)-measurable. 

An inspection of the preceding argument shows that a strictly stationary solution 
exists under a weaker condition than (f>+6 < 1. If the sequence (<\>-\-8Z 2 _^) ■ ■ ■ ((f>+8Z 2 _ n ) 
converges to zero in in probability asn->oo and X t is a solution of the GARCH relation 
such that a 2 is bounded in probability as t — > — oo, then the same argument shows that 
a 2 must relate to the Z t as given. Furthermore, if the series on the right side of (1.1) 
converges in probability, then X t may be defined as before. It can be shown that this is 
the case under the condition that Elog((/> + 8Z?) < 0. (See Exercise 1.14.) For instance, 
for standard normal variables Z t and <\> = this reduces to 8 < 2e 7 ?s 3.56. On the other 
hand, the condition (f> + 6 < 1 is necessary for the GARCH process to have finite second 
moments. □ 

1.11 EXERCISE. Let 8 + 4> e [0, 1) and 1 - k8 2 - 4> 2 - 26(j> > 0, where k = EZf. Show 
that the second and fourth (marginal) moments of a stationary GARCH process are 
given by q/(1 - 6 - <p) and «a 2 (l + 6 + 4>)/(l - k8 2 - <p 2 - 20<£)(1 - 8 - <f>). From this 
compute the kurtosis of the GARCH process with standard normal Z t . [You can use 
(1.1), but it is easier to use the GARCH relations.] 

1.12 EXERCISE. Show that EXf = oo if 1 - k8 2 - <f> 2 - 28(f) = 0. 

1.13 EXERCISE. Suppose that the process X t is square-integrable and satisfies the 
GARCH relation for an i.i.d. sequence Z t such that Z t is independent of X t -i, Xt-2, ■ ■ ■ 
and such that of = E(A" t 2 | X t -i, Xt-i, ■ ■ ■), for every t, and some a,(f>,8 > 0. Show that 
4> + 8<l. [Derive that EX 2 = a + a £™ =1 {4> + 8)i + (4> + 0) ,l+1 EX t 2 _ ji _ 1 .] 

1.14 EXERCISE. Let Z t be an i.i.d. sequence with Elog(Z t 2 ) < 0. Show that 
Sjlo ^t^t-i ' ' " %%-i < °° almost surely. [By the law of large numbers there exists 
for almost every realization of Z t a number N such that « _1 X]?=i log^f < c < for 
every n > N. Show that this implies that ^2 n>N Z 2 Z 2 _ X ■ ■ ■ Z 2 _- < oo almost surely] 



1.15 Example (Stochastic volatility). A general approach to obtain a time series with 
volatility clustering is to define X t = atZ t for an i.i.d. sequence Z t and a process at that 
depends "positively on its past" . A GARCH model fits this scheme, but a simpler way 
to achieve the same aim is to let at depend only on its own past and independent noise. 
Because at is to have an interpretation as a scale parameter, we restrain it to be positive. 
One way to combine these requirements is to set 

h t = 8ht-i+W t , 



a 



2 _ p h t 

t — e , 



X t = a t Z t . 
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Figure 1.5. Realization of the GARCH(1, 1) process with a = 0.1, 
Gaussian white noise. 



= 0, 



: 0.8 of length 500 for Z t 



Here W t is a white noise sequence, h t is a (stationary) solution to the auto-regressive 
equation, and the process Z t is i.i.d. and independent of the process W t . If 6 > and 
ot-i = e' 1 '- 1 / 2 is large, then at = e' 1 '/ 2 will tend to be large as well, and hence the 
process X t will exhibit volatility clustering. 

The process h t will typically not be observed and for that reason is sometimes called 
latent. A "stochastic volatility process" of this type is an example of a (nonlinear) state 
space model, discussed in Chapter 9. Rather than defining at by an auto-regression in 
the exponent, we may choose a different scheme. For instance, an EGARCH(p, 0) model 
postulates the relationship 

p 
log a t = a + ^2 <Aj 1o S a t-j ■ 

3=1 

This is not a stochastic volatility model, because it does not include a random distur- 
bance. The symmetric EGARCH (p, q) model repairs this by adding terms depending on 
the past of the observed series X t = atZ t , giving 



log a t = a + Y^ Oj\Zt-j | + Yl <j>j log a t - 
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In this sense GARCH processes and their variants are much related to stochastic volatility 
models. In view of the recursive nature of the definitions of a t and X t , they are perhaps 
more complicated. □ 
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Figure 1.6. Realization of length 250 of the stochastic volatility model Xt 



„fct/2 



Zt for a standard 



Gaussian i.i.d. process Zt and a stationary auto-regressive process ht 
i.i.d. process Wt- 



■ 0.8ht—i + Wt for a standard Gaussian 



1.2 Filters 

Many time series in real life are not stationary. Rather than modelling a nonstationary 
sequence, such a sequence is often transformed in one or more time series that are 
(assumed to be) stationary. The statistical analysis next focuses on the transformed 
series. 

Two important deviations from stationarity are trend and seasonality. A trend is a 
long term, steady increase or decrease in the general level of the time series. A seasonal 
component is a cyclic change in the level, the cycle length being for instance a year or a 
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week. Even though Example 1.5 shows that a perfectly cyclic series can be modelled as 
a stationary series, it is often considered wise to remove such perfect cycles from a given 
series before applying statistical techniques. 

There are many ways in which a given time series can be transformed in a series 
that is easier to analyse. Transforming individual variables X t into variables f{X t ) by 
a fixed function / (such as the logarithm) is a common technique as is detrending by 
substracting a "best fitting polynomial in t" of some fixed degree. This is commonly 
found by the method of least squares: given a nonstationary time series X t we determine 
constants j3o,...,j3 p by minimizing 

n 

(fa, ...,&)-> £(X t -ft - ft* ^ P ) 2 - 

t=i 

Next the time series X t — /3q — Pit P P t p , for the minimizing coefficients fi , . . . , /3 P , 

is assumed to be stationary. 

A standard transformation for financial time series is to (log) returns, given by 

, X t X t 
los ~r — ' or y L 

If Xt/Xt-i is close to unity for all t, then these transformations are similar, as logx rj 
x — 1 for ikI. Because log(e c */e c (* -1 )) = c, a log return can be intuitively interpreted 
as the exponent of exponential growth. Many financial time series exhibit an exponential 
trend. 

A general method to transform a nonstationary sequence in a stationary one, advo- 
cated with much success in a famous book by Box and Jenkins, is filtering. 

1.16 Definition. The (linear) filter with filter coefficients tpj for j £ Z is the operation 
that transforms a given time series X t into the time series Y t = J2jez i'j-^t-j- 

A linear filter is nothing but a moving average of infinite order. In Lemma 1.27 
we give conditions for the infinite series to be well defined. All filters used in practice 
are finite filters: only finitely many coefficients are nonzero. Important examples are the 
difference filter V X t = X t — X t -i, its repetitions V k X t = W k ~ x Xt defined recursely 
for k = 2, 3, . . ., and the seasonal difference filter V ' ^Xt = Xt — Xt-k- 

1.17 Example (Polynomial trend). A linear trend model could take the form X t = 
at + Z t for a strictly stationary time series Z t . If a ^ 0, then the time series X t is not 
stationary in the mean. However, the differenced series V X t = a+Z t — Z t -i is stationary. 

Thus differencing can be used to remove a linear trend. Similarly, a polynomial trend 
can be removed by repeated differencing: a polynomial trend of degree k is removed by 
applying V fe . □ 

1.18 EXERCISE. Check this for a series of the form X t = at + bt 2 + Z t . 

1.19 EXERCISE. Does a (repeated) seasonal filter also remove polynomial trend? 
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Figure 1.7. Prices of Hewlett Packard on New York Stock Exchange and corresponding log returns. 

1.20 Example (Random walk). A random walk is defined as the sequence of partial 

sums Xt = Z\ + Z2 H h Zt of an i.i.d. sequence Z t . A random walk is not stationary, 

but the differenced series VX t = Z t certainly is. □ 

1.21 Example (Monthly cycle). If X t is the value of a system in month t, then Vi2A" t 
is the change in the system during the past year. For seasonable variables without trend 
this series might be modelled as stationary. For series that contain both yearly seasonality 
and trend, the series V k V\2X t might be stationary. □ 

1.22 Example (Weekly cycle). If X t is the value of a system at day t, then Y t = 
(1/7) J2j=o Xt-j is the average value over the last week. This series might show trend, 
but should not show seasonality due to day-of-the-week. We could study seasonality 
by considering the time series X t — Y t , which results from the filter with coefficients 

(V>o,...,V 6 ) = (6/7,-l/7,...,-l/7). n 

1.23 EXERCISE. Show that the result of two filters with coefficients ctj and j3j applied 
in turn (if well-defined) is the filter with coefficients 7, given by 7^ = Ylj a jfik-j- This 
is called the convolution of the two filters. Infer that filtering is commutative. 

1.24 Definition. A Biter with coefficients ipj is causal ifipj = for every j < 0. 
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Figure 1.8. Realization of the time series t -f- 0.05t -f- Xt for the stationary auto-regressive process Xt 
satisfying Xt — 0.8Xt — i = Zt for Gaussian white noise Zt, and the same series after once and twice differencing. 

For a causal filter the variable Y t = V • ipjX t -j depends only on the values 
X t ,X t -i, ... of the original time series in the present and past, not the future. This is 
important for prediction. Given X t up to some time £, we can calculate Y t up to time t. 
If Y t is stationary, we can use results for stationary time series to predict the future value 
Y t+1 . Next we predict the future value X t +i by X t +i = ^^{Yt+i - J2j>o^J Xt + 1 -j)- 

In order to derive conditions that guarantee that an infinite filter is well-defined, we 
start with a lemma concerning series' of random variables. Recall that a series J2t x t °f 
nonnegative numbers is always well-defined (although possibly oo), where the order of 
summation is irrelevant. Furthermore, for general numbers x± the absolute convergence 
Y^ t \ x t\ < °° implies that Ylt Xt ex ists as a finite number, where the order of summation 
is again irrelevant. We shall be concerned with series indexed by t e N, t £ Z, t € Z 2 , or 
t contained in some other countable set T. It follows from the preceding that J2teT x * ^ s 
well-defined as a limit as n — ► oo of partial sums ^2 teT x±, for any increasing sequence 
of finite subsets T n CT with union T, if either every x t is nonnegative or J2t l x *l < °°- 
For instance, in the case that the index set T is equal to Z, we can choose the sets 
T n = {t e Z: \t\ < n}. 



1.25 Lemma. Let (X t :t € T) be an arbitrary countable set of random variables. 
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(i) IfX t >0 for every t, then E^, t X t = £t Ex t (possibly +00J; 
(ii) If £ t E|A" t | < 00, then the series £ t A" t converges absolutely almost surely and 
E£ t X t = £ t EX t . 

Proof. Suppose T = DjTj for an increasing sequence T± c Ti c ■ ■ ■ of finite subsets of 
T. Assertion (i) follows from the monotone convergence theorem applied to the variables 
Yj = £ teT X t . Assertion (ii) follows from the dominated convergence theorem applied 
to the same variables Yj. These are dominated by £ t \X t \, which is integrable because 
its expectation can be computed as £ t E|A" t | by (i). ■ 

The dominated convergence theorem in the proof of (ii) actually gives a better result, 
namely: if £ t E|A" t | < 00, then 



E 



teT t€Tj 



->• 0, if Ti c T 2 c ■ ■ ■ t T. 



This is called the convergence in mean of the series £ t Xt. The analogous convergence 
of the second moment is called the convergence in second mean. Alternatively, we speak 
of "convergence in quadratic mean" or "convergence in L\ or L2" ■ 

1.26 EXERCISE. Suppose that E\X n - X\ p -> and E|X| P < 00 for some p>\. Show 
that EX* -> EX k for every < k < p. 

1.27 Lemma. Let (Z t :t £ Z) be an arbitrary time series and let £• \ipj\ < 00. 

(i) If sup t E\Z t \ < 00, then V ipjZ t -j converges absolutely, almost surely and in mean, 
(ii) If sup t E\Z t \ 2 < 00, then J2j ipjZt-j converges in second mean as well. 
(Hi) If the series Z t is stationary, then so is the series X t = V . ipjZ t _j and 'jx(ti) = 
T,iT,j^j+i-hlz(l)- 

Proof, (i). Because ^2 t E\ipjZ t -j\ < sup t E\Z t \ £ • \tpj\ < 00, it follows by (ii) of the 
preceding lemma that the series ^2jipjZ t is absolutely convergent, almost surely. The 
convergence in mean follows as in the remark following the lemma. 

(ii). By (i) the series is well-defined almost surely, and £ . ipjZ t -j — Y^\j\<k i'i^t-i = 
£i .i >fc ipjZt-j. By the triangle inequality we have 

E ^ z t-i\ 2 < (E iv^-,i) 2 = EE MWiWZt-iWZt-ii 

'\j\>k \j\>k |j|>fc|t|>* 

1 /2 

By the Cauchy-Schwarz inequality E\Z t -j\\Z t -i\ < (E\Z t -j\ 2 \EZ t -i\ 2 ) , which is 
bounded by sup t E|i? t | 2 . Therefore, in view of (i) of the preceding lemma the expec- 
tation of the left side of the preceding display is bounded above by 

E E IV*|supE|i? t | 2 = (£ |^|) 2 supE|Z t | 2 . 

\j\>k\i\>k ' \j\>k 
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This converges to zero as k — > oo. 

(iii). By (i) the series J^j^Pj^t-j converges in mean. Therefore, EVVj-^t-j = 
J2jipj~EZt, which is independent of t. Using arguments as before, we see that we can 
also justify the interchange of the order of expectations (hidden in the covariance) and 
double sums in 



lx 



(h) = COv(^ IpjXt+h-j, Yl ^i X t-i) 



= Yl Yl ^« cov(Z t+h -j, Z t -i) = zZzZ ^3^ilz{h -j + i). 

j i j i 

This can be written in the form given by the lemma by the change of variables (j, i) \-i 
(j,l-h+j). m 

1.28 EXERCISE. Suppose that the series Z t in (iii) is strictly stationary. Show that the 
series X t is strictly stationary whenever it is well-defined. 

1.29 EXERCISE. For a white noise series Z t , part (ii) of the preceding lemma can be im- 
proved: Suppose that Z t is a white noise sequence and V ■ ip? < oo. Show that V ■ tpjZ t -j 
converges in second mean. (For this exercise you need some of the material of Chapter 2.) 



1.3 Complex Random Variables 

Even though no real-life time series is complex-valued, the use of complex numbers is 
notationally convenient to develop the mathematical theory. In this section we discuss 
complex-valued random variables. 

A complex random variable Z is a map from some probability space into the field 
of complex numbers whose real and imaginary parts are random variables. For complex 
random variables Z = X + iY, Z\ and i?2, we define 

EZ = EX + iEY, 
van Z = E\Z - EZ\ 2 , 



coy(Z 1 ,Z 2 ) = E(Zi - EZ±){Z 2 - EZ 2 ). 
Some simple properties are, for a, /3 s C, 

EaZ = aEZ, E~Z = EZ, 

vaxZ = E\Z\ 2 - \EZ\ 2 = varX + vary = cov(Z, Z), 

v&v(aZ) = \a\ 2 var Z, 

cov(aZi , fiZ 2 ) = a/3 cov(i?i , Z 2 ), 



cov{Zi,Z 2 ) = cov{Z 2 ,Zi) = EZ X Z 2 - EZ X EZ 2 . 
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1.30 EXERCISE. Prove the preceding identities. 

The definitions given for real time series apply equally well to complex time series. 
Lemma 1.27 also extends to complex time series Zt, where in (iii) we must read "fx{h) = 

1.31 EXERCISE. Show that the auto-covariance function of a complex stationary time 
series Z t is conjugate symmetric: 7z(— h) = Jz(h) for every h £ Z. 



1.4 Multivariate Time Series 

In many applications the interest is in the time evolution of several variables jointly. This 
can be modelled through vector-valued time series. The definition of a stationary time 
series applies without changes to vector-valued series X t = (X ty i,. . . ,X ty d), provided 
that the mean EX t is understood to be the vector (EX ty i, . . . ,X ty d) of means of the 
coordinates and the auto-covariance function is defined to be the matrix 

lx (h) = (cov(X t+hyi , X t ,jj) = E(X t+h - EX t+h )(X t -EX t ) T . 

\ / i,j=l,...,d 

The auto-correlation at lag h is defined as 



px(h) = Ux t+h , i ,X t , j j) . =( 

\ / i,j=l,...,d \ 



y/'Yx(0)i,i'Yx(0) j , j 'i,J=i,-,d 



The study of properties of multivariate time series can often be reduced to the study of 
univariate time series by taking linear combinations a T X t of the coordinates. The first 
and second moments satisfy 

Ea T X t = a T EX t , ^ a T X (h) = a T ~fx (h)a. 
1.32 EXERCISE. What is the relationship between ^ x {h) and 7x(-/i)? 



2 

Hilbert Spaces 

and Prediction 



In this chapter we first recall definitions and basic facts concerning Hilbert spaces. Next 
we apply these to solve the prediction problem: finding the "best" predictor of X n+ \ 
based on observations X\, . . . , X n . 



2.1 Hilbert Spaces and Projections 

Given a measure space (Q, U, [i) define £2(0, U, n) as the set of all measurable functions 
/: Q — > C such that J |/| 2 dfi < 00. (Alternatively, all measurable functions with values 
in M with this property.) Here a complex-valued function is said to be measurable if both 
its real and imaginary parts are measurable functions, and its integral is by definition 
J f dji = jRefd^ + iJImf djj,, provided the two integrals on the right are defined and 
finite. Define 



{h,h) = I A/2^, 



l/l 2 ^, 



d{fuh) = \\h-h\\ = \jf\h-h 



■ dfi. 



These define a semi-inner product, a semi-norm, and a semi-metric, respectively. The 
first is a semi-inner product in view of the properties: 

</l+/2,/3> = </l,/3> + </2,/3>, 

(ah,Ph)=c$(h,h), 



(h,h) = (h,h), 
(/; /) > 0; w ith equality iff / = 0, a.e.. 
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The second is a semi-norm because it has the properties: 

ll/l+/ 2 ||<||/l|| + ||/2||, 

ll"/ll = MII/ll, 

11/11 = iff/ = 0,a.e.. 

Here the first line, the triangle inequality is not immediate, but we prove it below. The 
other properties are more obvious. The third is a semi- distance, in view of the relations: 

d(fi,f 3 )<d(f 1 ,h)+d(h,f 3 ), 

d(/i,/a)=d(/ 2 ,/i), 

d(/i,/ 2 )=0iff/i=/ 2 ,a.e.. 

Immediate consequences of the definitions and the properties of the inner product 
are 

11/ + sll 2 = {f + gJ + g) = ll/ll 2 + </, g) + (g, /) + llsl I 2 , 
ll/ + sll 2 = ll/ll 2 + yi 2 , S(f,g) = 0. 

The last equality is known as the Pythagorean rule. In the complex case this is true, 
more generally, if Re(/, g) = 0. 

2.1 Lemma (Cauchy-Schwarz). Any pair f,g in I/ 2 (fi,W, fi) satisfies |(/, g)\ < ||/||||g||. 

Proof. This follows upon working out the inequality ||/ — Ag|| 2 > for A = (/, g)/||g|| 2 . ■ 

Now the triangle inequality for the norm follows from the preceding decomposition 
°f 11/ + d\\ 2 an d the Cauchy-Schwarz inequality, which, when combined, yield 

ll/ + 5l| 2 <ll/l| 2 + 2||/|||MI + ll3l| 2 =(ll/ll + yi) 2 - 



Another consequence of the Cauchy-Schwarz inequality is the continuity of the inner 
product: 

/™ -» /,5n -> 9 implies that (f n ,g n ) -> (f,g). 

2.2 EXERCISE. Prove this. 

2.3 EXERCISE. Prove that |||/|| - ||«/|| < ||/ - g\\. 

2.4 EXERCISE. Derive the parollellogram rule: \\f + g\\ 2 + \\f - g\\ 2 = 2||/|| 2 + 2||£;|| 2 . 

2.5 EXERCISE. Prove that ||/ + ig|| 2 = ||/|| 2 + H5H 2 for every pair /, g of real functions 
in £ 2 (fi,ZY,/i). 

2.6 EXERCISE. Let ft = {1, 2, . . . , k}, U = 2 n the power set of ft and \i the counting 
measure on Q. Show that £ 2 (fi,W, n) is exactly C fe (or M k in the real case). 

We attached the qualifier "semi" to the inner product, norm and distance defined 
previously, because in every of the three cases, the last property involves a null set. For 
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instance ||/|| = does not imply that / = 0, but only that / = almost everywhere. If 
we think of two functions that are equal almost everywere as the same "function" , then 
we obtain a true inner product, norm and distance. We define L2{^l,U,n) as the set of 
all equivalence classes in £2(0, W, n) under the equivalence relation "/ = g if and only 
if / = g almost everywhere" . It is a common abuse of terminology, which we adopt as 
well, to refer to the equivalence classes as "functions". 

2.7 Proposition. The metric space L2{£l,M,n) JS complete under the metric d. 

We shall need this proposition only occasionally, and do not provide a proof. The 
proposition asserts that for every sequence /„ of functions in £2(0, U, n) such that J |/„ — 
f m \ 2 dn — > as m, n — > 00 (a Gauchy sequence), there exists a function / £ C2(Sl,U,u) 
such that J \f n — f\ 2 d/j, — ¥ as n — » 00. 

A Hilbert space is a general inner product space that is metrically complete. The 
space L2(Sl,U,u) is an example, and the only example we need. (In fact, this is not a 
great loss of generality, because it can be proved that any Hilbert space is (isometrically) 
isomorphic to a space L^^l, U,\x) for some (Q,U,fi).) 

2.8 Definition. Two elements f,g of £2(^1 U,li) are orthogonal if (/,<?) = 0. This is 
denoted / ± g. Two subsets J 7 , Q of £2^, U,n) are orthogonal if / ± g for every /e J 
and g e Q. This is denoted F ± Q. 

2.9 EXERCISE. If / ± Q for some subset Q c £ 2 (fi,W,P), show that / ± Irn£, where 
lin(/ is the closure of the linear span of Q. 

2.10 Theorem (Projection theorem). Let L c L2(Sl,U,u) be a closed linear subspace. 
For every f £ I/2(fi, U, n) there exists a unique element 11/ g L that minimizes \\f — l\\ 2 
over I g L. This element is uniquely determined by the requirements ±1/ e L and 

/-n/±T. 

Proof. Let d = infj ei ||/ — Z|| be the "minimal" distance of/ to L. This is finite, because 
€ L. Let l n be a sequence in L such that ||/ — l n \\ 2 — > d. By the parallellogram law 

\\(lm -/) + (/- Uf = 2||J ra - /|| 2 + 2||/ - ZJ 2 - \\(l m -/)-(/- J„)|| 2 
= 2\\l m - /|| 2 + 2||/ - U 2 " 4||!(J ra + /„) - /|| 2 . 

Because (l m + l n )/2 e L, the last term on the right is bounded above by — Ad 2 . The two 
first terms on the far right both converge to 2d 2 asm,m 00. We conclude that the 
left side, which is \\l m — l n \\ 2 , is bounded above by Ad 2 + o(l) — Ad 2 and hence, being 
nonnegative, converges to zero. Thus the sequence l n is Cauchy and has a limit / by the 
completeness of L2{£l,U,n). The limit is in L, because L is closed. By the continuity of 
the norm ||/ — /|| = lim ||/ — l n \\ = d. Thus the limit / qualifies as 11/. 

If both IIi/ and II2/ are candidates for n/, then we can take the sequence 
h,h,h,- • ■ m the preceding argument equal to the sequence IIi/, II2/, IIi/, . . .. It then 
follows that this sequence is a Cauchy-sequence. This implies that IIi/ = n 2 /. 
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Finally, we consider the orthogonality relation. For every real number a and I £ L, 
we have 

||/ - (11/ + al) || 2 = ||/ - n/|| 2 - 2aRe{/ - 11/, I) + a 2 \\l\\ 2 . 

By definition of 11/ this is minimized at a = 0, whence the given parabola (in a) must 
have its bottom at zero, which is the case if and only if Re(/ — 11/, /) = 0. A similar 
argument with ia instead of a shows that Im(/ — 11/, /) = as well. Thus / — 11/ _L L. 
Conversely, if (/ — 11/, /) = for every I £ L and 11/ g L, then 11/ — I g L for every 
I £ L and by Pythagoras' rule 

11/ - i\\ 2 = ||(/ - uf) + (uf - /)|| 2 = ||/ - n/|| 2 + \\uf - 1\\ 2 > ||/ - n/|| 2 . 

This proves that 11/ minimizes / 1-> ||/ — l\\ 2 . m 

The function 11/ given in the preceding theorem is called the (orthogonal) projection 
of / onto L. From the orthogonality characterization of 11/, we can see that the map 
/ i-> Uf is linear and decreases norm: 

U(f + g)=Uf + Ug, 
U(af) = aUf, 

l|n/|| < ||/||. 

A further important property relates to repeated projections. If U L f denotes the pro- 
jection of / onto I/, then 

U Ll U L J = U Ll f, ifLiCLa. 

Thus, we can find a projection in steps, by projecting a projection onto a bigger space 
a second time on the smaller space. This, again, is best proved using the orthogonality 
relations. 

2.11 EXERCISE. Prove the relations in the two preceding displays. 

The projection U Ll+L2 onto the sum L1 + L2 = {h +l2'h € Li} of two closed linear 
spaces is not necessarily the sum U Ll + ±l i2 of the projections. (It is also not true that 
the sum of two closed linear subspaces is necessarily closed.) However, this is true if the 
spaces L\ and Li are orthogonal: 

U Ll+L J = U Ll f + U L J, ifL!±L 2 . 

2.12 EXERCISE. 

(i) Show by counterexample that the condition L\ ± Li cannot be omitted, 
(ii) Show that L\ ± Li is sufficient. 
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(iii) Show that Li + L-z is closed if Li ± L2 and both Li and L2 are closed subspaces. 

2.13 EXERCISE. Find the projection H L f for L the one-dimensional space {A/ : A e C}. 

* 2.14 EXERCISE. Suppose that the set L has the form L = Li + iL 2 for two closed, 
linear spaces Li , L2 of real functions. Show that the minimizer of I i-> ||/ — /|| over I £ L 
for a real function / is the same as the minimizer of / h-» ||/ — /|| over Li. Does this imply 
that / — 11/ ± L2? Why is the preceding projection theorem of no use? 



2.2 Square-integrable Random Variables 

For (Q,U, P) a probability space the space £2(0, U,P) is exactly the set of all complex 
(or real) random variables X with finite second moment E|X| 2 . The inner product is the 
product expectation {X, Y) = FjXY, and the inner product between centered variables 
is the covariance: 

(X - EX, Y - EY) = cov(X, Y). 

The Cauchy-Schwarz inequality takes the form 

|EXF| 2 < E|X| 2 E|y| 2 . 

When combined the preceding displays imply that |cov(X, Y)\ < varXvary. Conver- 
gence X n -> X relative to the norm means that E|X n — X\ — » and is referred to 
as convergence in second mean. This implies the convergence in mean Fi\X n — X\ — ► 0, 
because E|X| < y^E|X| 2 by the Cauchy-Schwarz inequality. The continuity of the inner 
product gives that: 

E\X n - X\ 2 -> 0, E\Y n - Y\ 2 -> implies cov(X„, Y n ) -► cov(X, F). 

2.15 EXERCISE. How can you apply this rule to prove equalities of the type 
cov(J2 otjX t -j,Y, Pj Y t-j) = X)t Hj a iPj cov(X t _j, Y t -j), such as in Lemma 1.27? 

2.16 EXERCISE. Show that sd(X+Y) < sd(X)+sd(Y) for any pair of random variables 
X and Y. 



2.2.1 Conditional Expectation 

Let Wo C W be a sub a-field of U. The collection L of all Wo-measurable variables Y e 
L2(fi,W, P) is a closed, linear subspace of L2(fi,W, P) (which can be identified with 
L2(fi, Uq, P)). By the projection theorem every square-integrable random variable X 
possesses a projection onto L. This particular projection is important enough to derive 
a number of special properties. 
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2.17 Definition. The projection of X £ L^^l, U,~P) onto the the set of all Uq- 
measurable square-integrable random variables is called the conditional expectation of 
X given Uq. It is denoted by E(X\ Uq). 

The name "conditional expectation" suggests that there exists another, more intu- 
itive interpretation of this projection. An alternative definition of a conditional expecta- 
tion is as follows. 

2.18 Definition. The conditional expectation given Uq of a random variable X which 
is either nonnegative or integrable is defined as a Uo-measurable variable X' such that 
EX1 A = EX'1 A for every A € U . 

It is clear from the definition that any other ^-measurable map X" such that 
X" = X' almost surely is also a conditional expectation. Apart from this indeterminacy 
on null sets, a conditional expectation as in the second definition can be shown to be 
unique. Its existence can be proved using the Radon-Nikodym theorem. We shall not 
give proofs of these facts here. 

Because a variable X e 1/2(0, W, P) is automatically integrable, the second defi- 
nition defines a conditional expectation for a larger class of variables. If E|X| 2 < oo, 
so that both definitions apply, then they agree. To see this it suffices to show that a 
projection E(X\Uo) as in the first definition is the conditional expectation X' of the 
second definition. Now E(X\Uo) is Wo-measurable by definition and satisfies the equal- 
ity E(AT — E(X\ Uo))1a = for every A e Uq, by the orthogonality relationship of a 
projection. Thus X' = E{X\ Uq) satisfies the requirements of Definition 2.18. 

Definition 2.18 does show that a conditional expectation has to do with expecta- 
tions, but is not very intuitive. Some examples help to gain more insight in conditional 
expectations. 

2.19 Example (Ordinary expectation). The expectation EX of a random variable X 
is a number, and as such can be viewed as a degenerate random variable. It is also the 
conditional expectation relative to the trivial a-field Uq = {0, Q}. More generally, we have 
that E(X\ Uq) = EX if X and Uq are independent. In this case Uq gives "no information" 
about X and hence the expectation given Uq is the "unconditional" expectation. 

To see this note that E(EX)1.f = EXEl^ = EX l.p for every measurable set F such 
that X and F are independent. □ 

2.20 Example. At the other extreme we have that E{X\Uq) = X if X itself is Uq- 
measurable. This is immediate from the definition. "Given Uq we then know X exactly." 
□ 

A measurable map Y : Q — > D with values in some measurable space (D, V) generates 
a CT-field a(Y). The notation E(A"| Y) is an abbreviation of E(X| a(Y)). 

2.21 Example. Let (X, Y):Q — > K x M fe be measurable and possess a density f(x,y) 
relative to a a-finite product measure /iXi/onMxl* (for instance, the Lebesgue measure 
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on M k+1 ). Then it is customary to define a conditional density of X given Y" = y by 

» , x _ f(x,y) 

nx]y) ~ JHx,v)M*y 

This is well-defined for every y for which the denominator is positive, i.e. for all y in a 
set of measure one under the distribution of Y. 

We now have that the conditional expection is given by the "usual formula" 



E(X\Y)= Jxf(x\Y)dn(x), 



where we may define the right hand zero as zero if the expression is not well-defined. 

That this formula is the conditional expectation according to Definition 2.18 follows 
by a number of applications of Fubini's theorem. Note that, to begin with, it is a part of 
the statement of Fubini's theorem that the function on the right is a measurable function 
ofy. □ 

2.22 Lemma (Properties). 

(i) EE(X\U ) =EX. 

(ii) If Z is Uo -measurable, then E(ZX\U ) = ZE(X\U ) a.s.. (Here require that X £ 

L p (n, U, P) and Z e L q (Q, U, P) for 1 < p < oo and p- 1 + q- 1 = 1.) 
(in) (linearity) E(aX + 0Y\ U ) = aE(X\ U ) + (3E(Y\ U ) a.s.. 
(iv) (positivity) If X > a.s., then E{X\U ) > a.s.. 

(v) (towering property) IfU cMiCW, then E(E(X| U x )\ U ) = E(X\ U ) a.s.. 

The conditional expectation E(X|Y) given a random vector Y is by definition a 
<7(Y)-measurable function. For most Y, this means that it is a measurable function g(Y) 
of Y. (See the following lemma.) The value g(y) is often denoted by E(X\ Y = y). 

Warning. Unless P(Y = y) > it is not right to give a meaning to E(X\ Y = y) for 
a fixed, single y, even though the interpretation as an expectation given "that we know 
that Y = y" often makes this tempting. We may only think of a conditional expectation 
as a function y i-> E(X\ Y = y) and this is only determined up to null sets. 

2.23 Lemma. Let {Y a : a € A} be random variables on $7 and let X be a o~(Y a : a € A)- 
measurable random variable. 

(i) If A = {1,2, . ..,&}, then there exists a measurable map g:~M. k — ► M such that 

X = g(Y 1 ,...,Y k ). 
(ii) If \A\ = oo, then there exists a countable subset {a n }'^ =1 c A and a measurable 

map g:W° -> M such that X = g(Y ai , Y a2 , . . .). 



2.3 Linear Prediction 

Suppose that we observe the values X\,..., X n from a stationary, mean zero time series 
X t . 
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2.24 Definition. Suppose that EX t = 0. The best linear predictor of X n+ i is the linear 

combination (f>iX n + faXn-i + h (f> n Xi that minimizes E|X n+ i — Y\ 2 over all linear 

combinations Y of X\, . . . , X n . The minimal value E|X„+i — (f>iX n — ■ ■ ■ — <j> n Xi\ 2 is 
called the square prediction error. 

In the terminology of the preceding section, the best linear predictor of X n+ i is the 
projection of X n+ i onto the linear subspace lin (Xi, . . . , X n ) spanned by X\, . . . , X n . A 
common notation is H n X n+ i, for H n the projection onto lin(Xi, . . . , X n ). Best linear 
predictors of other random variables are defined similarly. 

Warning. The coefficients <j>i , . . . , <j> n in the formula Ti. n X n+ i = (f>iX n H h <j> n X\ 

depend on n, even though we shall often suppress this dependence in the notation. 

By Theorem 2.10 the best linear predictor can be found from the prediction equations 

(X n+1 - 4>iX n <f> n Xi,X t ) = 0, t = l,...,n. 

For a stationary time series X t this system can be written in the form 



(2.1) 



/ 7x(0) 7x(l) ••• -yx(n-l)\ t x / , ... 

7x(l) 7x(0) ••• 7*(»-2)WM /Tx(l) 



\jx(n-l) lx (n-2) •■• 7x (0) / ^ 



, lx (n) 



If the (n x n)-matrix on the left is nonsingular, then <f>± , . . . , <f> n can be solved uniquely. 
Otherwise there are more solutions for the vector {<j>\,.. . ,<j> n ), but any solution will 
give the best linear predictor II n X n+ i = (f>iX n + ••■ + § n X\. The equations express 
<j>i, . . . , 4> n in the auto-covariance function jx- In practice, we do not know this function, 
but estimate it from the data. (We consider this estimation problem later on.) Then we 
use the corresponding estimates for cj>i, . . . , <j> n to calculate the predictor. 

The prediction error can be expressed in the coefficients using Pythagoras' rule, 
which gives, for a stationary time series Xt, 

. . e|x„+i — n„x„+i| =e|x„+i| — Ein^x^+ii 

= 7x(o)- {<t>-L,...,<t> n ) r„(<£i,...,<£„), 

for r„ the covariance matrix of the vector (X±, . . . ,X n ), i.e. the matrix on the left left 
side of (2.1). 

Similar arguments apply to predicting X n +h for ft > 1. If we wish to predict the 
future values at many time lags ft = 1,2,..., then solving a n-dimensional linear sys- 
tem for every ft separately can be computer-intensive, as n may be large. Several more 
efficient, recursive algorithms use the predictions at earlier times to calculate the next 
prediction. We omit a discussion. 

2.25 Example ( Autoregression) . Prediction is extremely simple for the stationary 
auto-regressive time series satisfying X t = (f>X t -i + Z t for a white noise sequence Z t and 
\(f>\ < 1. The best linear predictor of X n+ i given X\, . . . ,X n is simply <j)X n (for n > 1). 
Thus we predict X„+i = 4>X n + Z n +\ by simply setting the unknown Z n +\ equal to its 
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mean, zero. The interpretation is that the Z t are external noise factors that are completely 
unpredictable based on the past. The square prediction error E|X n+ i — (f>X n \ 2 = EZ 2 +1 
is equal to the variance of this noise variable. 

The claim is not obvious, as is proved by the fact that it is wrong in the case that 
\(f>\ > 1. To prove the claim we recall from Example 1.8 that the unique stationary solution 
to the auto-regressive equation in the case that \(f>\ < 1 is given by X t = X^o ft^t-j- 
Thus X t depends only on Z s from the past and the present. Because Z t is a white noise 

sequence, it follows that X t is uncorrelated with the variables Z t +i, Z t +2, Therefore 

{X n+ i — (f>X n ,X t ) = (Z„+i,Xt) = for t = 1,2, . . . ,n. This verifies the orthogonality 
relationship; it is obvious that (f>X n is contained in the linear span of X\, . . . , X n . □ 

2.26 EXERCISE. There is a hidden use of the continuity of the inner product in the 
preceding example. Can you see where? 

2.27 Example (Deterministic trigonometric series). For the process X± = A cos(A£) + 
B sin(At), considered in Example 1.5, the best linear predictor of X n+ i given X\ , . . . , X n 
is given by 2(cosA)X n — X n -i, for n > 2. The prediction error is equal to 0! This 
underscores that this type of time series is deterministic in character: if we know it at 
two time instants, then we know the time series at all other time instants. The explanation 
is that from the values of X t at two time instants we can recover the values A and B. 

These assertions follow by explicit calculations, solving the prediction equations. It 
suffices to do this for n = 2: if X3 can be predicted without error by 2(cosA)X2 — X\, 
then, by stationarity, X n +i can be predicted without error by 2(cosA)X 7l — X n -i- o 

2.28 EXERCISE. 

(i) Prove the assertions in the preceding example, 
(ii) Are the coefficients 2 cos A, — 1, 0, . . . , in this example unique? 

If a given time series X t is not centered at 0, then it is natural to allow a constant 
term in the predictor. Write 1 for the random variable that is equal to 1 almost surely. 

2.29 Definition. The best linear predictor of X n+ i based on Xi,..., X n is the projec- 
tion of X n+ i onto the linear space spanned by l,Xi, ... , X n . 

If the time series X t does have mean zero, then the introduction of the constant 
term 1 does not help. Indeed, the relation EX t = is equivalent to If ± 1, which 
implies both that 1 ± lin (Xi, . . . , X n ) and that the projection of X n+ i onto lin 1 is 
zero. By the orthogonality the projection of X„+i onto lin (1,Xl, . . . ,X n ) is the sum 
of the projections of X n +\ onto lin 1 and lin (X±, . . . , X n ), which is the projection on 
lin (Xi, . . . , X n ), the first projection being 0. 

By a similar argument we see that for a time series with mean u. = EX t possibly 
nonzero, 

( 2 - 3 ) tl\in(i,x !,..., x n )X n+ i = fi + Rv 1 „(x 1 - f i,...,x n - f i)(X n+ i - fj). 
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Thus the recipe for prediction with uncentered time series is: substract the mean from 
every X t , calculate the projection for the centered time series X t — n, and finally add the 
mean. Because the auto-covariance fucntion jx gives the inner produts of the centered 
process, the coefficients <j>\, . . . , <j> n of X n — /j,,. . . , X\ — \x are still given by the prediction 
equations (2.1). 

2.30 EXERCISE. Prove formula (2.3), noting that EX t = fiis equivalent to X t - fi ± 1. 



2.4 Nonlinear Prediction 

The method of linear prediction is commonly used in time series analysis. Its main 
advantage is simplicity: the linear predictor depends on the mean and auto-covariance 
function only, and in a simple fashion. On the other hand, if we allow general functions 
f(Xi, . . . , X n ) of the observations as predictors, then we might be able to decrease the 
prediction error. 

2.31 Definition. The best predictor of X n+ i based on Xi,...,X n is the function 

l 1 2 

f n {Xi , . . . , X n ) that minimizes E|X n+ i — f(Xi ,... , X n ) \ over all measurable functions 
f-.W 1 -*R. ' 

In view of the discussion in Section 2.2.1 the best predictor is the conditional ex- 
pectation E(X 7l+ i| Xi,... ,X n ) oi X n +i given the variables X\, . . . ,X n . Best predictors 
of other variables are defined similarly as conditional expectations. 

The difference between linear and nonlinear prediction can be substantial. In "clas- 
sical" time series theory linear models with Gaussian errors were predominant and for 
those models the two predictors coincide. Given nonlinear models, or non-Gaussian dis- 
tributions, nonlinear prediction should be the method of choice, if feasible. 

2.32 Example (GARCH). In the GARCH model of Example 1.10 the variable X n +i is 
given as a n +iZ n +i, where <j n +i is a function of X n , X n -i, ■ ■ ■ and Z n +i is independent 
of these variables. It follows that the best predictor of X n +i given the infinite past 
X n , X n -i, ... is given by a n +iE(Zn+i\ X n , X n -i, . . .) = 0. We can find the best predictor 
given X n , ■ ■ ■ , X\ by projecting this predictor further onto the space of all measurable 
functions of X n , . ..,X\. By the linearity of the projection we again find 0. 

We conclude that a GARCH model does not allow a "true prediction" of the future, 
if "true" refers to predicting the values of the time series itself. 

On the other hand, we can predict other quantities of interest. For instance, the 
uncertainty of the value of X n +i is determined by the size of a n +i- If 0Vi+i is close to 
zero, then we may expect X n +i to be close to zero, and conversely. Given the infinite 
past X n , X n -i, ■ ■ ■ the variable a n +i is known completely, but in the more realistic sit- 
uation that we know only X n , . . . ,X\ some chance component will be left. (For large n 
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the difference between these two situations will be small.) The dependence of a n+ i on 
X n , X n -i, ... is complicated and highly nonlinear. We shall discuss this later. □ 



2.5 Partial Auto-Correlation 

For a given mean-zero stationary time series X± the partial auto- correlation of lag h is 
defined as the correlation between Xh — Tlh-iXh and Xq — II^_iXo, where n^ is the 
projection onto lin (X\, ... , Xh). This is the "correlation between Xh and X with the 
correlation due to the intermediate variables Xi,..., Xh-i removed". We shall denote it 
by 

a x (h) = p(X h - Uh-iXh, X - Uh-iXo) . 

For an uncentered stationary time series we set the partial auto-correlation by definition 
equal to the partial auto-correlation of the centered series X t — EX t . A convenient method 
to compute ax is given by the prediction equations combined with the following lemma: 
ax {h) is the coefficient of X\ of the best linear predictor of Xh+i based on X± , . . . , Xh- 

2.33 Lemma. Suppose that X t is a mean-zero stationary time series. If (f>iXh+(f>2Xh-i + 
■ ■ ■ + (f>hXi is the best linear predictor of Xh+i based on X\,...,Xh, then ax{h) = <j>h- 

Proof. Let ip\Xh H h iph-iX^ =: I^/iXi be the best linear predictor of X\ based on 

X2, ■ ■ ■ , , Xh- The best linear predictor of Xh+i based anX\,..., Xh can be decomposed 
as 

TlhXh+i =4>iX h + --- + 4> h X 1 

= [(^1 + Mi)x h + ■■■ + (<j>h-i + <f>hiph-i)x 2 ] + 4> h [(Xx - u 3 , h x 1 )] . 

The two terms in square brackets are orthogonal, because Xi — U 2y hXi ± lin (X?, . . . , Xh) 
by the projection theorem. Therefore, the second term in square brackets is the projection 
of UhXh+i onto the one-dimensional subspace lin (Xi — n 2j ^Xi). It is also the projection 
of Xh+i onto this one-dimensional subspace, because lin {X\— 112,/iXi) c lin(Xi, . . . , Xh) 
and we can compute projections by first projecting onto a bigger subspace. 

The projection of Xh+i onto the one-dimensional subspace lin (Xi — r^/iXi) is easy 
to compute directly. It is given by a(Xi — n^/iXi) for a given by 

_ {Xh+l,Xi — 112,/iXi) _ {Xh+l — Il2,hXh+l,Xi — 112,/iXi) 

a ~ \\x 1 -u 2th x 1 \\ 2 ~ \\x 1 -u 2th x 1 \\ 2 ' 

Because the prediction problem is symmetric in time, as it depends on the auto-covariance 
function only, \\Xi — U 2y hXi\\ = \\Xh+i — Ti2,hXi\\. Therefore, the right side is exactly 
ax (h). In view of the preceding paragraph, we have a = <f>h and the lemma is proved. ■ 
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2.34 Example ( Autoregression) . According to Example 2.25, for the stationary auto- 
regressive process X t = <f>Xt-i +Z t with \(j>\ < 1, the best linear predictor of X n+ i based 
on Xi, . . . , X n is <j>X n , for n > 1. Thus the partial auto-correlations ax{h) of lags h > 1 
are zero and ax(l) = <j>- This is often viewed as the dual of the property that for the 
moving average sequence of order 1, considered in Example 1.6, the auto-correlations of 
lags h > 1 vanish. 

In Chapter 7 we shall see that for higher order stationary auto-regressive processes 
X t = <\>\Xt-\ + ■ • • + (f>pX t -p + Z t the partial auto-correlations of lags h > p are zero 
under the (standard) assumption that the time series is "causal" . □ 



3 

Stochastic Convergence 



This chapter provides a review of modes of convergence of sequences of stochastic vectors. 
In particular, convergence in distribution and in probability. Many proofs are omitted, 
but can be found in most standard probability books. 



3.1 Basic theory 

A random vector in M k is a vector X = (Xi, ... ,Xk) of real random variables. More 
formally it is a Borel measurable map from some probability space in M k . The distribution 
function of X is the map x — ► P(X < x). 

A sequence of random vectors X n is said to converge in distribution to X if 

P(X n < x) -> P(X < x), 

for every x at which the distribution function x — » P(X < x) is continuous. Alternative 
names are weak convergence and convergence in law. As the last name suggests, the 
convergence only depends on the induced laws of the vectors and not on the probability 
spaces on which they are defined. Weak convergence is denoted by X n ~-> X; it X has 
distribution L or a distribution with a standard code such as N(0,1), then also by 
X n ~* Lor X n ~> N(0, 1). 

Let d(x, y) be any distance function on M k that generates the usual topology. For 
instance 

d{x,y) = \\x-y\\ = ^2_ / (x l -y i ) 2 j . 
i=i 

A sequence of random variables X n is said to converge in probability to X if for all e > 

P(d{X n ,X)>e) ^0. 
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This is denoted by X n 4 X. In this notation convergence in probability is the same as 
d(X n ,X)$0. 

As we shall see convergence in probability is stronger than convergence in distribu- 
tion. Even stronger modes of convergence are almost sure convergence and convergence 
in pth mean. The sequence X n is said to converge almost surely to X if d(X n , X) — ► 
with probability one: 

P(\imd(X n ,X) = 0) =1. 

This is denoted by X n f| X. The sequence X n is said to converge in pth mean to X if 

Ed(X n , xy -> 0. 

This is denoted A"™ _£ X. We already encountered the special cases p = 1 or p = 2, 
which are referred to as "convergence in mean" and "convergence in quadratic mean" . 

Convergence in probability, almost surely, or in mean only make sense if each X n 
and X are defined on the same probability space. For convergence in distribution this is 
not necessary. 

The portmanteau lemma gives a number of equivalent descriptions of weak con- 
vergence. Most of the characterizations are only useful in proofs. The last one also has 
intuitive value. 

3.1 Lemma (Portmanteau). For any random vectors X n and X the following state- 
ments are equivalent. 

(i) P(X n <x)^rP(X<x) for all continuity points of x -> P(X < x); 

(ii) Ef(X n ) — > Ef(X) for all bounded, continuous functions f; 
(Hi) Fjf(X n ) — > Fjf(X) for all bounded, Lipschitz^ functions f; 
(iv) liminf P(A„ e G) > P(X e G) for every open set 67; 

(v) limsup P(X n e F) < P(X e F) for every closed set F; 

(vi) P(I„eB)-}P(Ie B) for all Borel sets B with P(Ie<5B)=0 where SB =B-B 
is the boundary of B. 

The continuous mapping theorem is a simple result, but is extremely useful. If the 
sequence of random vector X n converges to X and g is continuous, then g{X n ) converges 
to g(X). This is true without further conditions for three of our four modes of stochastic 
convergence. 

3.2 Theorem (Continuous mapping). Let g:M k — > M m be measurable and continuous 
at every point of a set C such that P(X e C) = 1. 

(i) IfX n ~> X, then g(X n ) ~> g(X); 
(ii) IfX n 4 X, then g(X n ) 4 g(X); 
(iii) IfX n ^ X, then g(X n ) jg. g(X). 

Any random vector X is tight: for every e > there exists a constant M such that 
P(||X|| > M) < e. A set of random vectors {X a :a € A} is called uniformly tight if M 



* A function is called Lipschitz if there exists a number L such that |/(x) — f{y)\ < Ld(x, y) for every x 
and y. The least such number L is denoted ||/||iip. 



3.1: Basic theory 33 

can be chosen the same for every X a : for every e > there exists a constant M such 
that 

supP(||X a || > M) <e. 

a 

Thus there exists a compact set to which all X a give probability almost one. Another 
name for uniformly tight is bounded in probability. It is not hard to see that every weakly 
converging sequence X n is uniformly tight. More surprisingly, the converse of this state- 
ment is almost true: according to Prohorov's theorem every uniformly tight sequence 
contains a weakly converging subsequence. 

3.3 Theorem (Prohorov's theorem). Let X n be random vectors in M k . 
(i) If X n ~-> X for some X, then {X n :n e N} is uniformly tight; 

(ii) If X n is uniformly tight, then there is a subsequence with X nj ~-> X as j — ► oo for 
some X. 

3.4 Example. A sequence X n of random variables with E|X n | = 0(1) is uniformly 
tight. This follows since by Markov's inequality: P(|X„| > M) < E\X n \/M. This can be 
made arbitrarily small uniformly in n by choosing sufficiently large M. 

The first absolute moment could of course be replaced by any other absolute mo- 
ment. 

Since the second moment is the sum of the variance and the square of the mean an 
alternative sufficient condition for uniform tightness is: El n = 0(1) and varX„ = 0(1). 
□ 

Consider some of the relationships between the three modes of convergence. Conver- 
gence in distribution is weaker than convergence in probability, which is in turn weaker 
than almost sure convergence and convergence in pth mean. 

3.5 Theorem. Let X n , X and Y n be random vectors. Then 
(i) X n ^X implies X n L f X; 

(ii) X n h>X implies X n 4 X; 
(Hi) X n 4 X implies X n ~+ X; 
(iv) X n 4 c for a constant c if and only if X n ~+ c; 

(v) ifX n ^ X and d(X n , Y n ) 4 0, then Y n ~> X; 
(vi) if X n ~-> X and Y n E$ c for a constant c, then (X n , Y n ) ~-> (X, c); 
(vii) ifX n 4 X and Y n 4 Y, then (X n , Y n ) 4 (X, Y). 

Proof, (i). The sequence of sets A n = D rn > n {d(X rn , X) > e} is decreasing for every 
e > and decreases to the empty set if X n (u>) — > X(u>) for every cj. If X n f| X, then 
P(d(X n ,X)>e)<P(A n )^0. 

(ii). This is an immediate consequence of Markov's inequality, according to which 
P(d(X n , X) > e) < e-P~Ed(X n , X)? for every e > 0. 

(v). For every bounded Lipschitz function / and every e > we have 

\&f(X n ) - E/(F„)| < e\\f\\ Lip &l{d{X n ,Y n ) < e) 

+ 2\\f\\ 00 El{d(X n ,Y n )>e}. 



34 3: Stochastic Convergence 

The second term on the right converges to zero asn-^oo. The first term can be made 
arbitrarily small by choice of e. Conclude that the sequences Ef(X n ) and Ef(Y n ) have 
the same limit. The result follows from the portmanteau lemma. 

(iii). Since d(X n , X) 1$. and trivially X ~-> X it follows that X n ~-> X by (v). 

(iv). The 'only if part is a special case of (iii). For the converse let ball(c, e) be the 
open ball of radius e around c. Then V(d(X n ,c) > e) = P(X n £ ball(c,e) c ). If X n ~+ e, 
then the limsup of the last probability is bounded by P(c e ball(c, e) c ) = 0. 

(vi). First note that d((X n ,Y n ),(X n ,c)) = d(Y n ,c) ^ 0. Thus according to (v) it 
suffices to show that (X n ,c) ~~> (X, c). For every continuous, bounded function (x,y) — > 
f(x, y), the function x — > f(x, c) is continuous and bounded. Thus Fjf(X n , c) — > E/(A", c) 
ifX n -^X. 

(vii). This follows from d((xi,yi),(x 2 ,y 2 )) < d(x 1 ,x 2 ) +d(yi,y 2 ). m 

According to the last assertion of the lemma convergence in probability of a se- 
quence of vectors X n = (X n ,i, ■ ■ ■ ,X n ,k) is equivalent to convergence of every one of 
the sequences of components X n> i separately. The analogous statement for convergence 
in distribution is false: convergence in distribution of the sequence X„ is stronger than 
convergence of every one of the sequences of components X n ^. The point is that the dis- 
tribution of the components X n ,i separately does not determine their joint distribution: 
they might be independent or dependent in many ways. One speaks of joint convergence 
in distribution versus marginal convergence. 

The one before last assertion of the lemma has some useful consequences. If X n ~+ X 
and Y n ~-> c, then (X n , Y n ) ~» (X, c). Consequently, by the continuous mapping theorem 
g(X„, Y n ) ~-+ g(X, c) for every map g that is continuous at the set K fe x {c} where the 
vector (X, c) takes its values. Thus for every g such that 

lim g(x, y) = g(x , c) , every x . 

x^rXQ,y^rc 

Some particular applications of this principle are known as Slutsky's lemma. 

3.6 Lemma (Slutsky). Let X n , X and Y n be random vectors or variables. If X n ~+ X 
and Y n ~+ c for a constant c, then 

(i) X n + Y n ^X + c; 

(ii) Y n X n ~> cX; 
(in) X n /Y n ~-+ X/c provided c ^ 0. 

In (i) the "constant" c must be a vector of the same dimension as X, and in (ii) c is 
probably initially understood to be a scalar. However, (ii) is also true if every Y n and c 
are matrices (which can be identified with vectors, for instance by aligning rows, to give 
a meaning to the convergence Y n ~-> c), simply because matrix multiplication (y,x) — ► yx 
is a continuous operation. Another true result in this case is that X n Y n ~+ Xc, if this 
statement is well-defined. Even (iii) is valid for matrices Y n and c and vectors X n provided 
c ^ is understood as c being invertible and division is interpreted as (pre) multiplication 
by the inverse, because taking an inverse is also continuous. 
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3.7 Example. Let T n and S n be statistical estimators satisfying 

^(T n -6)^N(0,a 2 ), Sl^a\ 

for certain parameters 6 and <r 2 depending on the underlying distribution, for every 
distribution in the model. Then 6 = T n ± S n /^/n{; a is a confidence interval for 6 of 
asymptotic level 1 — 2a. 

This is a consequence of the fact that the sequence y/n(T n — 6)/S n is asymptotically 
standard normal distributed. □ 



3.2 Convergence of Moments 

By the portmanteau lemma, weak convergence X n ~-> X implies that Fjf(X n ) — > Ef(X) 
for every continuous, bounded function /. The condition that / be bounded is not su- 
perfluous: it is not difficult to find examples of a sequence X n ~-> X and an unbounded, 
continuous function / for which the convergence fails. In particular, in general conver- 
gence in distribution does not imply convergence EX P — > EX P of moments. However, in 
many situations such convergence occurs, but it requires more effort to prove it. 

A sequence of random variables Y n is called asymptotically uniformly integrable if 

lim limsupEly^llJly^l > M} = 0. 

M-S-oo n_j.no 

A simple sufficient condition for this is that for some p > 1 the sequence E|y„| p is 
bounded in n. 

Uniform integrability is the missing link between convergence in distribution and 
convergence of moments. 

3.8 Theorem. Let f:M k — > M be measurable and continuous at every point in a set C. 
Let X n ~-> X where X takes its values in C. Then Fjf(X n ) — > E/(A") if and only if the 
sequence of random variables f{X n ) is asymptotically uniformly integrable. 

3.9 Example. Suppose X n is a sequence of random variables such that X n ~-> X and 
lim sup E| X n \ p < oo for some p. Then all moments of order strictly less than p converge 
also: EX^ -> EX k for every k < p. 

By the preceding theorem, it suffices to prove that the sequence X k is asymptotically 
uniformly integrable. By Markov's inequality 

E|_T 7l | fe l{|_: 7l | fe >M}< M 1 - p / k v\x n \ p . 

The limsup, as n — > oo followed by M — > oo, of the right side is zero if k < p. □ 
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3.3 Arrays 

Consider an infinite array x n j of numbers, indexed by (n, I) € N x N, such that every 
column has a limit, where the limits xi themselves converge to a limit along the column. 



£1,1 


£1,2 


£1,3 


£1,4 ■■■ 


£2,1 


£2,2 


£2,3 


£2,4 ■ ■ ■ 


£3,1 


£3,2 


£3,3 


£3,4 ■ ■ ■ 


I 


I 


I 


I ... 


Xi 


X 2 


£3 


£4 



-> £ 

Then we can find a path £„,;„ , indexed byngN through the array along which £„,;„ — ► £. 
(The point is to move to the right slowly in the array while going down, i.e. l n — » 00.) A 
similar property is valid for sequences of random vectors, where the convergence is taken 
as convergence in distribution. 

3.10 Lemma, For n, I £ N let X ny i be random vectors such that X ny i ~+ Xi as n — > 00 
for every fixed I for random vectors such that Xi ~+ X as I — ► 00. Then there exists a 
sequence l n — > 00 such X n j n ~» X as n — > 00. 

Proof. Let D = {d±, d 2 , . . .} be a countable set that is dense in M k and that only contains 
points at which the distribution functions of the limits X,Xi,X 2 ,. . . are continuous. 
Then an arbitrary sequence of random variables Y n converges in distribution to one of 
the variables Y € {X, X\,X 2 , ■ ■ ■} if and only if P(Y n < di) -> P(Y < di) for every 
di e D. We can prove this using the monotonicity and right-continuity of distribution 
functions. In turn P{Y n < di) — ► P(Y < di) for every di £ D if and only if 



Y}?{Y n < di) - P(Y < di)\2- { -> 0. 

i=l 

Now define 

Pn,i = 1]|P(^,( < di) - P{X[ < di)\2-\ 

i=l 

00 

pi = Y,\ P ( Xl ^ d i) ~ P ( X ^ *)| 2_< - 

i=l 

The assumptions entail that p ny i — > as n — > 00 for every fixed I, and that pi — > as 
I — > 00. This implies that there exists a sequence /„ — > 00 such that p n ,i„ — > 0. By the 
triangle inequality 

00 
£|P(*™,i. < *) " p (^ < di)\2-> < p nJn +pi n ->• 0. 



This implies that X ny i n -^ I as n -> 00. 
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3.4 Stochastic o and symbols 

It is convenient to have short expressions for terms that converge in probability to zero 
or are uniformly tight. The notation op(l) ('small "oh-P-one"') is short for a sequence 
of random vectors that converges to zero in probability. The expression Op(l) ('big "oh- 
P-one" ') denotes a sequence that is bounded in probability. More generally, for a given 
sequence of random variables R n 

X n = o P {R n ) means X n = Y n R n and Y n L,. 0; 
X n = P (R n ) means X n = Y n R n and Y n = P {\). 

This expresses that the sequence X n converges in probability to zero or is bounded 
in probability at 'rate' R n . For deterministic sequences X n and R n the stochastic oh- 
symbols reduce to the usual o and O from calculus. 

There are many rules of calculus with o and O symbols, which will be applied 
without comment. For instance, 

Op(l)+0 P (l)= p(l) 

o P (l)+Op(l)=Op(l) 
P (l)op(l)=op(l) 

(l + op(l)) _1 =Op(l) 
o P (R n ) = R n o P (l) 
P {R n ) = R n O P {\) 
o P (Op(l)) =o P (l). 

To see the validity of these 'rules' it suffices to restate them in terms of explicitly named 
vectors, where each op(l) and Op(l) should be replaced by a different sequence of 
vectors that converges to zero or is bounded in probability. In this manner the first 
rule says: if X n ^ and Y n E>. 0, then Z n = X n + Y n E>. 0; this is an example of the 
continuous mapping theorem. The third rule is short for: if X n is bounded in probability 
and Y n E>. 0, then X n Y n E>. 0. If X n would also converge in distribution, then this 
would be statement (ii) of Slutsky's lemma (with c = 0). But by Prohorov's theorem X n 
converges in distribution 'along subsequences' if it is bounded in probability, so that the 
third rule can still be deduced from Slutsky's lemma by 'arguing along subsequences'. 

Note that both rules are in fact implications and should be read from left to right, 
even though they are stated with the help of the equality "=" sign. Similarly, while it is 
true that op(l) + op(l) = 2op(l), writing down this rule does not reflect understanding 
of the op-symbol. 

Two more complicated rules are given by the following lemma. 

3.11 Lemma. Let R be a function defined on a neighbourhood of G M. k such that 
R(0) = 0. Let X n be a sequence of random vectors that converges in probability to zero. 
(i) ifR(h) = o(\\h\\) as h -> , then R(X n ) = o P (\\X n \\); 
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(ii) ifR(h) = 0(\\h\\) as ft -> 0, then R{X n ) = P {\\X n \\). 

Proof. Define g(h) as g(h) = R{h)/\\h\\ for ft ^ and 5(0) = 0. Then i?(X„) = 
g(X„)||X„||. 

(i). Since the function g is continuous at zero by assumption, g(X n ) E> g(0) = by 
the continuous mapping theorem. 

(ii). By assumption there exist M and 5 > such that \g(h) | < M whenever ||ft|| < S. 
Thus P(|g(X„)| > M) < P(\\X n \\ > 6) ->■ 0, and the sequence g(X n ) is tight. ■ 

It should be noted that the rule expressed by the lemma is not a simple plug-in rule. 
For instance it is not true that R(h) = o(\\h\\) implies that R(X n ) = op(\\X n \\) for every 
sequence of random vectors X n . 



3.5 Cramer- Wold Device 

It is sometimes possible to show convergence in distribution of a sequence of random vec- 
tors directly from the definition. In other cases 'transforms' of probability measures may 
help. The basic idea is that it suffices to show characterization (ii) of the portmanteau 
lemma for a small subset of functions / only. 

The most important transform is the characteristic function 

t^Ee'* Tx , ieffi fe . 

Each of the functions x — > e lt x is continuous and bounded. Thus by the portmanteau 
lemma Ee rt Xn — > Ee rt x for every t if X n ~-> X. By Levy's continuity theorem the 
converse is also true: pointwise convergence of characteristic functions is equivalent to 
weak convergence. 

3.12 Theorem (Levy's continuity theorem). Let X n and X be random vectors in 
R k . Then X n ~> X if and only if Ee itTx " ->■ Ee itTx for every t £ M fe . Moreover, if 
Ee lt Xn converges pointwise to a function (f>(t) that is continuous at zero, then cf> is the 
characteristic function of a random vector X and X n ~-> X. 

The characteristic function t — > Ee l * x of a vector X is determined by the set of all 
characteristic functions u — > Ee™(* x ^> of all linear combinations t T X of the components 
of X. Therefore the continuity theorem implies that weak convergence of vectors is 
equivalent to weak convergence of linear combinations: 

X n ~» X if and only if t T X n ~» t T X for all tgl 1 . 

This is known as the Cramer-Wold device. It allows to reduce all higher dimensional 
problems to the one-dimensional case. 
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3.13 Example (Multivariate central limit theorem). Let Y, Y\, Y 2 , . . . be i.i.d. random 
vectors in M k with mean vector [x = EY and covariance matrix S = E(Y — [i)(Y — n) T . 
Then 



1 ™ 



(The sum is taken coordinatewise.) By the Cramer- Wold device the problem can be 
reduced to finding the limit distribution of the sequences of real- variables 



1 n 1 n 



-Vn^ * / y/ri _ 

Since the random variables t T Y\ — t T /j,,t T Y2 — t T jjL,... are i.i.d. with zero mean and 
variance t T T,t this sequence is asymptotically Ni(0,t T T,t) distributed by the univariate 
central limit theorem. This is exactly the distribution of t T X if X possesses a Nk{0, S) 
distribution. □ 



3.6 Delta-method 

Let T n be a sequence of random vectors with values in M k and let cf>: M k — ► M m be a 
given function defined at least on the range of T n and a neighbourhood of a vector 6. 
We shall assume that, for given constants r n —> oo, the sequence r n (T n — 6) converges in 
distribution, and wish to derive a similar result concerning the sequence r n (</>(T n ) — (f)(8)) ■ 
Recall that <j> is differentiable at 6 if there exists a linear map (matrix) (f>' e : M k — ¥ K m 
such that 

cf>(6 + h)-cf>(6)=<f>B(h)+o(\\h\\), h^O. 

All the expressions in this equation are vectors of length m and \\h\\ is the Euclidean 
norm. The linear map h —> <j>' g (h) is sometimes called a total derivative, as opposed to 
partial derivatives. A sufficient condition for <j> to be (totally) differentiable is that all 
partial derivatives d<f>j(x)/dxi exist for x in a neighbourhood of 6 and are continuous 
at 6. (Just existence of the partial derivatives is not enough.) In any case the total 
derivative is found from the partial derivatives. If (j> is differentiable, then it is partially 
differentiable and the derivative map h — ► (f>' e (h) is matrix multiplication by the matrix 

&w ••■ aw 

.tfW ••• ffW. 

If the dependence of the derivative <j)' e on 6 is continuous, then cf) is called continuously 
differentiable. 
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3.14 Theorem. Let 4>: M k — > M m be a measurable map defined on a subset of K fe and 
differentiable at 6. Let T n be random vectors taking their values in the domain of 4>. If 
r n {T n - 6) ~-> T for numbers r n -> oo, then r n (<f)(T n ) - (f)(6)) ~-> 4>' e {T). Moreover, the 
difference between r n ((f>(T n ) — (f>(6)) and (f>' e (r n (T n — 8)) converges to zero in probability. 

Proof. Because r n — > oo, we have by Slutsky's lemma T n — 6 = (l/r n )r n (T n — 6) ~-> 
OT = and hence T n — 6 converges to zero in probability. Define a function g by 

,,, 4>(Q + h) - M6) - 4>'Jh) ., , ; 

5 (0) =0, »(*») = - ' ii^j ' 6K \ if/i^O. 

Then g is continuous at by the differentiability of <j>. Therefore, by the continuous 
mapping theorem, g(T n — 6) E> and hence, by Slutsky's lemma and again the continuous 
mapping theorem r„||T„ — 6\\g(T n — 6) i^. ||T||0 = 0. Consequently, 

r n (4>(T n ) - 4>{e) - 4>' e (T n - 8)) = r n \\T n - 6\\g{T n - 6) 4 0. 

This yields the last statement of the theorem. Since matrix multiplication is continuous, 
(f>' e (r n (T n — 6)) ~-> (j>' e (T) by the continuous-mapping theorem. Finally, apply Slutsky's 
lemma to conclude that the sequence r n ((f>(T n ) — (f>(6)) has the same weak limit. ■ 

A common situation is that y/n(T n — 9) converges to a multivariate normal distribu- 
tion Nk{n, £). Then the conclusion of the theorem is that the sequence -^(^(T^) — 4>(0)) 
converges in law to the 7V m ((/>^, (f>' e Y,((f)' e ) T ) distribution. 



3.7 Lindeberg Central Limit Theorem 

In this section we state, for later reference, a central limit theorem for independent, but 
not necessarily identically distributed random vectors. 

3.15 Theorem (Lindeberg). For each n £ N let Y nj i, . . . , Y njTl be independent random 
vectors with finite covariance matrices such that 

1 ™ 

i=i 

1 ™ 

-y^E||y„v|| 2 l{||y n>i || >ey/n}->0, for every e > 0. 

i=l 

Then the sequence n~ 1 ^ 2 ^2" = i{Y n ,i — EY n ,i) converges in distribution to the normal 
distribution with mean zero and covariance matrix S. 



4 

Central Limit Theorem 



The classical central limit theorem asserts that the mean of independent, identically 
distributed random variables with finite variance is asymptotically normally distributed. 
In this chapter we extend this to certain dependent and/or nonidentically distributed 
sequences. 

Given a stationary time series X t let X n be the average of the variables X\, . . . , X n . 
If n and jx are the mean and auto-covariance function of the time series, then, by the 
usual rules for expectation and variance, 

EX n = n, 

n n n , , , 

(4.1) var(V^) = -££cov(X s ,X t ) = £ (HZ_W) 7x (ft). 

s=l t=l h=—7i 

In the expression for the variance every of the terms (n— \h\) jn is bounded by 1 and con- 
verges to 1 as n — > oo. If 5Z|7x(/i)| < °°; then we can apply the dominated convergence 
theorem and obtain that vai(y/nX n ) — > ^2 h ^x{h). In any case 

(4.2) var^X n <Y,\lx(h)\. 

h 

Hence absolute convergence of the series of auto-covariances implies that the sequence 
\fn(X n — jjl) is uniformly tight. The purpose of the chapter is to give conditions for 
this sequence to be asymptotically normally distributed with mean zero and variance 

E h ix(h). 

Such conditions are of two broad types. One possibility is to assume a particular type 
of dependence of the variables X t , such as Markov or martingale properties. Second, we 
might require that the time series is "not too far" from being a sequence of independent 
variables. We present three sets of sufficient conditions of the second type. These have 
in common that they all require that the elements Xt and Xt+h at large time lags h are 
approximately independent We start with the simplest case, that of finitely dependent 
time series. Next we generalize this in two directions: to linear processes and to mixing 
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time series. The term "mixing" is often used in a vague sense to refer to time series' 
whose elements that are separated in time are approximately independent. For a central 
limit theorem we then require that the time series is "sufficiently mixing" and this can 
be made precise in several ways. In ergodic theory the term "mixing" is also used in a 
precise sense. We briefly discuss the application to the law of large numbers. 

* 4.1 EXERCISE. Suppose that the series v:= ^2 h 7x{h) converges (not necessarily ab- 
solutely). Show that vaiy/nX n — ► v. [Write var \fnX n as v n for Vh, = ^2\j\ < hlx(j) and 
apply Cesaro's lemma: if v n — > v, then v n — > v.] 



4.1 Finite Dependence 

A time series X t is called m-dependent if the random vectors (.. . ,X t -i,X t ) and 

(Xt+ m +i, Xt+m+2, ■ ■ •) are independent for every tgZ. 

4.2 EXERCISE. Show that the moving average X t = Z t + 6Z t -\ considered in Exam- 
ple 1.6 is 1-dependent. 

4.3 EXERCISE. Show that "O-dependent" is equivalent to "independent". 

4.4 Theorem. Let X t be a strictly stationary, m-dependent time series with mean zero 
and finite variance. Then the sequence \/nX n converges in distribution to a normal 
distribution with mean and variance Y^h=- m lx{h). 

Proof. Choose a (large) integer I and divide X\ , . . . , X n into r = [n/l\ groups of size I 
and a remainder group of size n — rl<l. Let A± t i, . . . , A T j and -Bi,;, . . . , B Tt i be the sums 
of the first / — m and last m of the variables Xi in the r groups. Then both Aij, . . . , A r j 
and Bi y i, . . . , B ry i are sequences of independent identically distributed random variables 
(but the two sequences may be dependent) and 

(4.3) E*' = I>A' + I>A' + E X *- 

For fixed / and n->co (hence r — > oc) the classical central limit theorem applied to the 
variables Ajj yields 

1 r [7 1 r 1 

Furthermore, by the triangle inequality 



Wn.^. / \/n 
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Because the mean of the variables rr 1 !" 1 J27=ri+i Xi is zero, this sequence converges to 
zero in probability by Chebyshev's inequality. We conclude by Slutsky's lemma that, as 
n — ► oo, 

1 r i ™ 1 

S nX .= -=J2A jtl + -= V Xi*~N(o,jvaiA ltl ). 

V J=l V i=rl+l 

This is true for every fixed /. If I — > oo, then 

1 1 l — TCI I — TCI j i,i TCh 

yvar^i,^ yvar^X^ ^ ™ lx(h)^v:= ^ lx {h). 

i=l h=m—l h=—m 

By Lemma 3.10 there exists a sequence l n — > oo such that S ny i n ~-> N(0,v). Let r„ = 
[n// n J be the corresponding sequence of values of r n , so that r n /n — > 0. By the strict 
stationarity of the series X t the distribution of Bjj is the same as the distribution of 

Xi -\ h X m and hence is independent of (j, I). Hence var Bjj is independent of j and 

I and 

1 T 2 

E (-^Y 'Bj,i n ) = -varEi,^ -*•(). 

3=1 

Thus the sequence of random variables in the left side converges to zero in probablity, 
by Chebyshev's inequality. In view of (4.3) another application of Slutsky's lemma gives 
the result. ■ 



l + l 21 + 1 (r- 1)1 + 1 rl+1 

H 1 1 1 I 1 1 



Aij Bi,i ^2,1 B2J 

Figure 4.1. Blocking of observations in the proof of Theorem 4.4. 



4.2 Linear Processes 

In this section we extend the central limit theorem from finitely dependent time series 
to linear processes. These are processes that can be represented as infinite moving aver- 
ages. Given a sequence . . . , Z_i , Zq, Z\,Z^, ... of independent and identically distributed 
variables with EZ t = 0, a constant n, and constants ipj with V • \tpj\ < oo, we assume 
that 

oo 

(4.4) X t =n+ Y, tiZt-i- 

j= — CO 

This may seem special, but we shall see later that this includes, for instance, the rich 
class of all ARMA-processes. 
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By (iii) of Lemma 1.27 the covariance function of a linear process is given by jx (h) = 
a 2 V ipjipj + h, where a 1 = v&rZ t , and hence the asymptotic variance of \fnX n is given 
by " 

X ix (jo = ct 2 XX! ^^+A = ct2 (X ^) • 

4.5 Theorem. Suppose that (4.4) holds for an i.i.d. sequence Z t with mean zero and 
finite variance and numbers ipj with V. \ipj\ < oo. Then the sequence \/n(X n — /j.) 
converges in distribution to a normal distribution with mean zero and variance v. 

Proof. We can assume without loss of generality that n = 0. For a fixed (large) integer 
m define the time series 

xr= x iMt-i = 5>r2t-j, 

\j\<m j 

where ipj 1 = ipj if \j\ < to and otherwise. Then X™ is (2m + Independent and strictly 
stationary. By Theorem 4.4, the sequence -JnX™ converges in distribution to a normal 
distribution with mean zero and variance 

v m :=Y J ixAh) = * 2 Y.Y.^7 +h = ° 2 (Y. ^) 2 ■ 

h h j \j\<m 

The first equality follows from (iii) of Lemma 1.27. As m — > oo this variance converges 
to v. Because N(0,v m ) ~-> N(0,v), Lemma 3.10 guarantees that there exists a sequence 
m n — > oo such that ^/nXn n ~* N(0,v). 

In view of Slutsky's lemma the proof will be complete once we have shown that 
y/n(X n - XY") 4 0. This concerns the average W" of the differences Y™ = X t - X™ = 
il\j\>m^3 Z t-r These satisfy 

E(VHT 7l -V^XT r ) 2 =varV^^r r <XI^'""( /l )l ^ CT '( X Witf ■ 

h \j\>m„ 

The inequalities follow by (4.1) and Lemma 1.27(iii). The right side converges to zero as 
m n — > oo. ■ 



4.3 Strong Mixing 

The a-mixing coefficients (or strong mixing coefficients) of a strictly stationary time 
series X t are defined by a(0) = | and for k £ N* 

a(h)=2 sup \P(Ar\B)-P(A)P(B)\. 

Aea(...,X- U X ) 

Be*{x h ,x h+1 ,...) 



* We denote by a(Xt: t £ /) the cr-field generated by the set of random variables -{Xt'. t G /}• 
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By the strict stationary of X t the a-fields <r(. . . , A"_i, A"o) and a(Xh, Xh+i, . . .) in this 
definition could be replaced by the a-fields a{. . . , X t -i,X t ) and a(X t +hX t +h+i, ■ ■ ■) for 
any t € Z, without changing the value of a(h). The point is that the a-fields are separated 
by h time instants. The a-mixing coefficients measure the extent by which events A and 
B that are separated by h time instants fail to satisfy the equality P(A<lB) = P(A)P(B), 
which is valid for independent events. 

It is immediate from their definition that the coefficients a(l), a(2), . . . are decreas- 
ing and nonnegative. Furthermore, if the time series is m-dependent, then a(h) = for 
h > m. 

4.6 EXERCISE. Show that a(l) <\ = a(0). [Apply the inequality of Cauchy-Schwarz 
toP( J 4nB)-P( J 4)P(B)=cov(U,l B ).] 

If a(h) — ► as h — ► oo, then the time series X t is called a-mixing or strong mixing. 
Then events connected to time sets that are far apart are "approximately independent" . 
For a central limit theorem to hold, we also need that the convergence to takes place 
at a sufficient speed, dependent on the "sizes" of the variables X t . 

A precise formulation can best be given in terms of the inverse function of the mixing 
coefficients. We can extend a to a function a: [0, oo) — > [0, 1] by defining it to be constant 
on the intervals [h, h + 1) for integers h. This yields a monotone function that decreases 
in steps from a(0) = | to at infinity in the case that the time series is mixing. The 
generalized inverse a -1 : [0, 1] — > [0, oo) is defined by 

oo 

or 1 ( u ) = inf {x > 0: a(x) < u) = ^ l u<a {h) ■ 

h=0 

4.7 Theorem. If Xt is a strictly stationary time series with mean zero such that 
J a _1 (M)F,^ 1 ,(l — u) 2 du < oo, then the series v = ^2 h 7x(h) converges absolutely 
and y/nX n ~> N(0,v). 

Here F^ 1 denotes the quantile function of a given random variable X, so that 

F^il -u)= inf{x: 1 - F x {x) < u}, 

for Fx the distribution function of X. 

At first sight the condition of the theorem is complicated. Finiteness of the integral 
requires that the mixing coefficients converge to zero fast enough, but the rate at which 
this must happen is also dependent on the tails of the variables X t . To make this concrete 
we can derive finiteness of the integral under a combination of a mixing and a moment 
condition. If c r : = E|JTo| r < oo for some r > 2, then 1 — F\ Xo \(x) < c r jx r by Markov's 
inequality and hence F,^ 1 ,(1 — u) < c/u 1 ^. Then we obtain the bound 



f 1 °° f 1 c 2 c 2 r °° 

/ <x~ 1 (u)F\x 1 \( 1 - u ) 2du ^Y, l u< a (h)-Tfdu=——Y j a{h) 1 - 



-1/r 
h=0 " u " ' ~ h= 
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Thus the moment condition E^tl 7, < oo and the mixing condition J2T=o oi{h) 1 ~ 2 / r < oo 
together are sufficient for the central limit theorem. This shows a trade-off between 
moments and mixing: for larger values of r the moment condition is more restrictive, but 
the mixing condition is weaker. 

4.8 EXERCISE (Case r = oo). Show that J* a~ 1 (u)-F,~ 1 ,(1 - u) 2 du is bounded above 
by H-XoIItc Y^h=o a {h)- [Note that F,~£ ,(1 — U) is distributed as \Xq\ if U is uniformly 
distributed and hence is bounded by ||Xo||oo almost surely] 

4.9 EXERCISE. Show that £ a" 1 ^)^"^^ - u) 2 du<{m + l)EXg if the time series 
Xt is m-dependent. Recover Theorem 4.4 from Theorem 4.7. 

4.10 EXERCISE. Show that J^ a' 1 (ujF^Al - u) 2 du < oo implies that E|X | 2 < oo. 

For the proof of Theorem 4.7 we need a number of lemmas. Let \\X\\ P denote the 
I/p-norm of a random variable X, i.e. 

\\X\\ P = (E\X\ P ) 1/P , l<p<oo, \\X\\ 00 =irf{M:P(\X\<M) = l}. 

Recall Holder's inequality: for any pair of numbers p, q > (possibly infinite) with 
p~ x + q~ x = 1 and random variables X and Y 

E\XY\ < ||X|| p ||y|| 9 . 

For p = q = 2 this is precisely the inequality of Cauchy-Schwarz. The other case that will 
be of interest to us is the case p = 1, q = oo, for which the inequality is easy to prove. By 
repeated application the inequality can be extended to more than two random variables. 
For instance, for any numbers p, q, r > with p~ x + q~ x + r _1 = 1 and random variables 
X, Y, and Z ' 

E\XYZ\ < \\X\\ p \\Y\\ q \\Z\\ r . 

4.11 Lemma (Covariance bound). Let X t be a strictly stationary time series with 
a-mixing coefficients a(h) and let Y and Z be random variables that are measurable 
relative to cr(. . . , X_i, Xo) and a(Xh, X^+i, ■ ■ ■), respectively, for a given h > 0. Then, 
for any p,q,r>0 such that p~ x + q~ x + r _1 = 1, 

/ot(h) 
F|-](l - u)F|-|(l -u)du< 2a(/i) 1 / p ||y|| 9 ||Z|| r . 

Proof. By the definition of the mixing coefficients, we have, for every y, z > 0, 

\cov(l Y + >y ,lz+>z)\ < \a(h). 

The same inequality is valid with Y + and/or Z + replaced by Y~ and/or Z . It follows 
that 

|cov(ly +>? , - l Y -> v Az+>z ~ lz-> z )| < 2a(h). 
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Because |cov(£/, V)| < 2(E|[/|)||V|| 00 for any pair of random variables U, V (the simplest 
Holder inequality), we obtain that the covariance on the left side of the preceding display 
is also bounded by 2(P(Y + > y) + P(Y~ > y)). Yet another bound for the covariance 
is obtained by interchanging the roles of Y and Z. Combining the three inequalities, we 
see that, for any y, z > 0, 

|cov(l y+>J/ - l Y - >y , l z+>z - l z - >z )\ < 2a(h) A 2P(|Y| > y) A 2P(|Z| > z) 

*(h) 



r a ( n ) 

= 2/ h-F: Y :(y)>u h-F: z \(z)>udU. 

Jo 



Next we write Y = Y + — Y = J ° C {^Y+> y — ^-y-> v ) dy and similarly for Z, to obtain, 
by Fubini's theorem, 



|cov(Y,Z)|= / / cov(l Y+>y -l Y - >y ,l z+>z -l z - >z )dydz 
Jo Jo 

/•oo /-oo /-c*(/l) 

< 2 / / / lF IYI (y)<i-u~l-F lzl (z)<i-ududydz. 
Jo Jo Jo 

Any pair of a distribution and a quantile function satisfies Fx (x) < u if and only x < 
F^iu), for every x and u. We can conclude the proof of the first inequality of the lemma 
by another application of Fubini's theorem. 

The second inequality follows upon noting that -F|yl(l — U) is distributed as \Y\ if 
U is uniformly distributed on [0, 1], and next applying Holder's inequality. ■ 

4.12 Lemma. Let X n be a sequence of random variables such that ElA^I 2 = 0(1) and 
such that Fj(iX n + v\)e zXXn — » as n — ► oo, for every Ael and some v > 0. Then 
X n ^N(0,v). 

Proof. By Markov's inequality and the bound on the second moments, the sequence X n 
is uniformly tight. In view of Prohorov's theorem it suffices to show that iV(0, v) is the 
only weak limit point. 

If X n ~-> X along some sequence of n, then by the boundedness of the second mo- 
ments and the continuity of the function x i-> (ix + v\)e lXx , we have Fj(iX n +v\)e zXXn — > 
Ei(iX+vX)e lXX for every Ael (Cf. Theorem 3.8.) Combining this with the assumption, 
we see that E(iX + vX)e iXX = 0. By Fatou's lemma EX 2 < liminf EX 2 < oo and hence 
we can differentiate the the characteristic function (f>(X) = Ee lXX under the expectation 
to find that (j>'{X) = EiXe iXX . We conclude that (j>'{X) = -v\</>(\). This differential 
equation possesses ^(A) = e~ vX / 2 as the only solution within the class of characteristic 
functions. Thus X is normally distributed with mean zero and variance v. m 

Proof of Theorem 4.7. As a consequence of Lemma 4.11 we find that 

ra(h) rl 

52\-yx(h)\<2j2 F lx 1 o] (l-u) 2 du = 2 a-^^F^l-ufdu. 

h>0 h>0 J ° Jo 
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This already proves the first assertion of Theorem 4.7. Furthermore, in view of (4.2) and 
the symmetry of the auto-covariance function, 



(4.5) 



var - 



<4 



Jo 



MW 



u) 2 du. 



For a given M > let Xf 1 = X t l{\X t \ < M} and let Y t M = X t - X t M . Because X t M 
is a measurable transformation of Xt, it is immediate from the definition of the mixing 
coefficients that the series Y t M is mixing with smaller mixing coefficients than the series 
X t . Therefore, in view of (4.5) 



Jo 



var ^{X n - X") = var JnY™ < 4 / oT 1 (u) F^ { (1 - u) 2 du. 

Because 1^ M = whenever |Xo| < M, it follows that 1^ M ~-> as M — ► oo and hence 
fi'Mifu) — > for every u £ (0,1). Furthermore, because |y o M | < l^o|, its quantile 
function is bounded above by the quantile function of \Xo\. By the dominated convergence 
theorem the integral in the preceding display converges to zero as M — ► oo, and hence 
the variance in the left side converges to zero asM->oo, uniformly in n. If we can show 
that Vn(x¥ ~ EX M ) ~» JV(0, v M ) as th- oo for v M = lim var y/nXY and every fixed 
M, then it follows that y/n(X n - EX ) ~-> JV(0,u) for u = limv M = lim var \fnXZ,, by 
Lemma 3.10, and the proof is complete. 

Thus it suffices to prove the theorem for uniformly bounded variables X t . Let M be 
the uniform bound. 

Fix some sequence m n — > oo such that \/na{m n ) — > and m n j s/n — > 0. Such 
a sequence exists. To see this, first note that s/nad^/n/k]) — > as n — > oo, for ev- 
ery fixed k. (See Problem 4.13). Thus by Lemma 3.10 there exists k n — > oo such that 
^/na{\^^/n / k n \) — » as k n — > oo. Now set m n = LvW^J- F° r simplicity write m for 
m n . Also let it be silently understood that all summation indices are restricted to the 
integers 1, 2, . . . , n, unless indicated otherwise. 

Let S n = n- 1 / 2 Y,t=i x t and > for ever y S iven f > set S n{t) = n ~ 1/2 Y,\j-t\< m x j- 



Because le 2 



1 



i\\ < | A 2 for every A e M, we have 



v[^j:x t e^(e-^-l + iXS n (t))]\< ^f £ES 2 (t) 



A 2 M 



.>^E £ £ 1x{i-3) 

Z V n t =i |«_t 



|z — t|<7Tl |j — t|<7Tl 



A 2 M 



Furthermore, with A n (t) and £?„(£) defined as rr 1 ! 2 times the sum of the Xj with 
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1 < j < t — m and t + m<j<n, respectively, we have S n — S n (t) = A n {t) + B n (t) and 






te iXS„ e -i\S„(t) 



iXB„(t) 



< -Ly cov(X t e ixA ^\e iXB ^) + cov(X t ,e ixA ^)Ee 

< \\pnMa.(m) ->■ 0, 

by the second inequality of Lemma 4.11, with p = 1 and g = r = oo. Combining the 
preceding pair of displays we see that 

i ™ i 

E5„e iAS " =E^^X t e iAS "iA5 Jl (i)+o(l) =iAE(e iAS "-^^X s X t ) +o(l). 

\/Tl _ V /J _ _ / 



V^ 



|s— £|<m 



If we can show that n '^ E| s -t|<m -^s-^t converges in mean to v, then the right side 
of the last display is asymptotically equivalent to i\Ee tXSn v, and the theorem is proved 
in view of Lemma 4.12. 

In fact, we show that rT 1 J2 J2\ s -t\<m X s X t — > v in second mean. First, 

|s — t|<m |/l|<7Tl 

By the dominated convergence theorem, in view of (4.2). Second, 



var 



|s — t|<7Ti 



|s— t|<m |i— ,7|<Tn 



The first double sum on the right can be split in the sums over the pairs (s,t) with s < t 
and s > t, respectively, and similarly for the second double sum relative to (i, j). By 
symmetry the right side is bounded by 

\s— i|<m \i— j\<.m 

, n m n m 

s=l t=0 i=l j=0 
7i m n m 

s=l t=0 i=l j=0 

by the same argument, this time splitting the sums over s < i and s > i and using 
symmetry between s and i.lii >t, then the covariance in the sum is bounded above by 
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2a(i — t)M 4 , by Lemma 4.11, because there are i — t time instants between X s X s+t and 
X s+ iX s+ i + j .Iti<t, then we rewrite the absolute covariance as 

cov(X s , X s+t X s+ iX s+i+ j) - cov(X s , X s+ t)EX s+i X s+i+ j < 4a(i)M A . 

Thus the four-fold sum is bounded above by 



32 



n m n m 2 



^ £ £ £ £ («(< " *) M41 ^ + ««^ 4 1*<*) < 64M 4 ^- £ a(i). 

s =l 4=0 i=l j=0 i>0 

Because F,^- 1 , is bounded away from zero in a neighbourhood of 0, finiteness of the 

integral J a~ x (u)F,~£ , (1 — u) 2 du implies that the series on the right converges. This 
conclude the proof. ■ 

* 4.13 EXERCISE. Suppose that a(h) is a decreasing sequence of nonnegative numbers 
(/i = 1,2,.. .) with ^2 h a(h) < 00. Show that ha(h) — ► as h — ► 00. [First derive, using 
the monotonicity, that ^ h 2 h a(2 h ) < 00 and conclude from this that 2 h a(2 h ) — > 0. Next 
use the monotonicity again "to fill the gaps" .] 



* 4.4 Uniform Mixing 

There are several other types of mixing coefficients. The (f)-mixing coefficients or uniform 
mixing coefficients of a strictly stationary time series X± are defined by 

4>{h) = sup |P(B|i4)-P(B)|, 

A€a(...,X- U X ),P(A) 7 iO 
B€*(X h ,X h+1 ,...) 

4>(h)= sup \P(A\B)-P(A)\. 

A£<j{...,X_i,X a ) 
Be*(X h ,X h+1 ,...),Y(B)^0 

It is immediate from the definitions that a(h) < 2(4>{h) A 4>(h)). Thus a ^-mixing time 
series is always ce-mixing. It appears that conditions in terms of ^-mixing are often much 
more restrictive, even though there is no complete overlap. 

4.14 Lemma (Covariance bound). Let X t be a strictly stationary time series with 
(f>-mixing coefficients (f>(h) and (p(h) and let Y and Z be random variables that are 
measurable relative to a(...,X-i,Xo) and a(Xh,Xh+i, ■ ■ ■), respectively, for a given 
h>0. Then, for any p,q>0 with p^ 1 + q^ 1 = 1, 



|cov(F,Z)| < 2(j){h) 1 / p ^{hfl q \\Y\\ p \\Z\ 



1- 
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Proof. Let Q be the measure P Y ' Z — P Y ®P Z on M 2 , and let \Q\ be its absolute value. 
Then 

|cov(F,Z)| = \JJyzdQ(y,z)\ < Q j ' \y\ p dQ(y,z)) 1/P \j j ' \z\" dQ(y,z)) 1/q , 

by Holder's inequality. It suffices to show that the first and second marginals of \Q\ are 
bounded above by the measures 2(f>(h)P Y and 2(p(h)P z , respectively. By symmetry it 
suffices to consider the first marginal. 
By definition we have that 

|Q|(C) = sup(|Q(Cn J D)| + \Q(CnD c )\) 

D V V 

for the supremum taken over all Borel sets D in M 2 . Equivalently, we can compute the 
supremum over any algebra that generates the Borel sets. In particular, we can use the 
algebra consisting of all finite unions of rectangles Ay. B. Conclude from this that 

|Q|(C) = sup ££|Q(C 0(^x^)1, 

i 3 

for the supremum taken over all pairs of partitions M = Li iAi and M = U jBj. It follows 
that __ 

\Q\(AxR) =aup'%2Yi\Q(( AnA i) xB j)\ 

i 3 

= sup j2J2\ pZlY ( BAA n AA - pZ ( B j)\ pY ( A n A i)- 

i 3 

If, for fixed i, Bf consists of the union of all Bj such that P z \ Y (B j \AnA i )-P z (B j ) > 
and B^ is the union of the remaining Bj, then the double sum can be rewritten 

Y, (\P ZlY (B+\A n At) - P z {B+)\ + \P z \ Y {Br\A n At) - P z (Br)fjP Y (A n A t ). 

i 

The sum between round brackets is bounded above by 2<j>{h), by the definition of <j>. Thus 
the display is bounded above by 2(f>(h)P Y (A). ■ 

4.15 Theorem. If X± is a strictly stationary time series with mean zero such that 
E\X t \ pWq < oo and J^h <f>(h) 1/p 4>(h) 1/q < oo for some p,q > with p- 1 + q- 1 = 1, 
then the series v = ^2< h ^x{h) converges absolutely and \fnX n ~» N(0,v). 

Proof. For a given M > let X t M = X t l{\X t \ < M} and let Y t M = X t - Xf. 
Because JT t M is a measurable transformation of X t , it is immediate from the definition 
of the mixing coefficients that Y t M is mixing with smaller mixing coefficients than Xt. 
Therefore, by (4.2) and Lemma 4.14, 



var V^(X n - I«) < 2^^(/ 1 ) 1 /^(/ l )V 9 ||yM|| p || yoJ 



Mn 
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As M — ¥ oo, the right side converges to zero, and hence the left side converges to zero, 
uniformly in n. This means that we can reduce the problem to the case of uniformly 
bounded time series X t , as in the proof of Theorem 4.7. 

Because the a-mixing coefficients are bounded above by the ^-mixing coefficients, 
we have that J2h a (h) < °°. Therefore, the second part of the proof of Theorem 4.7 
applies without changes. ■ 



4.5 Ergodic Theorem 

The law of large numbers is concerned with the convergence of the sequence X n rather 
than the sequence \fn[X n — \i). By Slutsky's lemma X n — > jjl in probability if the 
sequence -Jn(X n — fi) is uniformly tight. Thus a central limit theorem implies a weak 
law of large numbers. However, the latter is valid under much weaker conditions. The 
weakening not only concerns moments, but also the dependence between the X t . 

The strong law of large numbers for a strictly stationary time series belongs to 
ergodic theory. In this section we discuss the main facts and some examples. For a weak 
law for stationary time series also see Example 6.30. 

Given a strictly stationary sequence X t defined on some probability space (Q,U,P), 
with values in some measurable space (X, A) the invariant cr-field, denoted U- mv , is the 
CT-field consisting of all sets A such that A= (..., X t -i , X t , X t +i , ■ ■ -) _1 (-B) for all t and 
some measurable set B c X°°. Here throughout this section the product space X°° is 
equipped with the product a- field A°° . 

Our notation in the definition of the invariant a-field is awkward, if not unclear, 
because we look at two-sided infinite series. The triple X t -i, X t , X t +i in the definition of 
A is meant to be centered at a fixed position in Z . We can write this down more precisely 
using the forward shift function <j>: X°° — > X°° defined by (f)(x) i = x i+ i. The two-sided 
sequence (. . . , X t -i, X t , X t +i, ■ ■ .) defines a map X:Q — » X°°. With this notation the 
invariant sets A are the sets such that A = {ftX £ B} for all t and some measurable 
set B C X°° . The strict stationarity of the sequence X is identical to the invariance of 
its induced law P x on X°° under the shift <j>. 

The inverse images X~ X (B) of measurable sets B c X°° with B = <j>B are clearly in- 
variant. Conversely, it can be shown that, up to null sets, all invariant sets take this form. 
(See Exercise 4.17.) The symmetric events are special examples of invariant sets. They 
are the events that depend symmetrically on the variables X t . For instance, C\ t X^ 1 {B) 
for some measurable set B c X. 

* 4.16 EXERCISE. Call a set B c X°° invariant under the shift <f>: X°° -> X°° if B = 4>B. 
Call it almost invariant relative to a measure P x if P x (BA(f>B) = 0. Show that a set B 
is almost invariant if and only if there exists an invariant set B such that P x (BAB) = 0. 
[Try B = HttfB.] 

* 4.17 EXERCISE. Define the invariant a-field B lm on X°° as the collection of measurable 
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sets that are invariant under the shift operation, and let B lm be its completion under 
the measure P x . Show that X^ 1 {B- lm ,) c U\ m C X — l{B- mv ), where the long bar on the 
right denotes completion relative to P. [Note that {X € B} = {X £ cf>B} implies that 
P X (B A <j)B) = 0. Use the preceding exercise to replace B by an invariant set B.} 

4.18 Theorem (Birkhoff). If X t is a strictly stationary time series with E\X t \ < oo, 
then X n — > Fj(Xo\Ui nv ) almost surely and in mean. 

Proof. For a given ael define a set B = {x e M°° : limsup^^^ I„ > a}. Because 

1 n+l 

_ x\ n 1 ^— v 

Xn+l = -—: + -—r- > X t , 
n + l n + l n *—? 



t=i 



a point x is contained in B if and only if lim sup n _1 ^"=2 x * - > a - Equivalently, x £ B if 
and only if ^x e B. Thus the set B is invariant under the shift operation <j>: M°° — > M°° . 
We conclude from this that the variable lim sup^^^ X n is measurable relative to the 
invariant a-field. 

Fix some measurable set B c M°° . For every invariant set A £ U- lm there exists a 
measurable set C C M°° such that A = {(jfX £ C} for every t. By the strict stationarity 
of X, 

P(peB}ni) =P(^leB,^leC) =P(le B.leC) =P({leB}ni). 

This shows that P((/>*X £ -B|Z4 nv ) = P(X € B\U- lm ) almost surely. We conclude that 
the conditional laws of <j>*X and X given the invariant a-field are identical. 

In particular, the conditional means E{Xt\U\ m ) = E(Xi|£/i nv ) are identical for 
every t, almost surely. It also follows that a time series Z t of the type Z t = (X t , R) 
for R: il — > TZ a fixed Wi nv -measurable variable (for instance with values in TZ = M 2 ) is 
strictly stationary, the conditional law of the first marginal of Z (on X°°) being strictly 
stationary and the second marginal (on TZ°°) being independent of t. 

For the almost sure convergence of the sequence X„ it suffices to show that, for 
every e > 0, the event 



A= {lim sup X„ > E{Xi\U lm ) + e\ 



and a corresponding event for the lower tail have probably zero. By the preced- 
ing the event A is contained in the invariant a-field. Furthermore, the time se- 
ries Y t = [X t — E(Xi\Uinv) — s)1a, being a fixed transformation of the time se- 
ries Z t = (X t ,E(Xi\ Wi nv ), 1,4), is strictly stationary. We can write A = D n A n for 
A n = U" =1 {Y~ t > 0}. Then EYil^ -> EYil^ by the dominated convergence theo- 
rem, in view of the assumption that X t is integrable. If we can show that EYi1a„ > 
for every n, then we can conclude that 

< EFiU = E(Xi - E^il^nv))!,! - eP(A) = -eP(A), 
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because A e U\ m . This implies that P(A) = 0, concluding the proof of almost sure 
convergence. 

The Li-convergence can next be proved by a truncation argument. We can first show, 
more generally, but by an identical argument, that rT 1 Y^t=i f(Xt) — > E(/(Xo)| U- lm ) 
almost surely, for every measurable function f:X — ► M with E|/(X t )| < oo. We can 
apply this to the functions f(x) = x1\ x \<m f° r given M. 

We complete the proof by showing that EYil^ > for every strictly stationary 
time series Y t and every fixed n, and A n = L)^ =1 {Y t > 0}. For every 2 < j < n, 

Yi + ■ ■ ■ + Yj < Yi + max(y 2 , Y 2 + Y 3 , ■ ■ ■ , Y 2 + ■ ■ ■ + Y n+1 ). 

If we add the number in the maximum on the right, then this is also true for j = 1. 
We can rewrite the resulting n inequalities as the single inequality 

Y x >max(Y 1 ,Y 1 +Y 2 ,. . .,Y X + ■ ■ ■ +Y n ) - im,x(0,Y 2 ,Y 2 +Y 3 ,- ■ ■ ,Y 2 + ■ ■ ■ + Y n+1 ). 

The event A n is precisely the event that the first of the two maxima on the right is 
positive. Thus on this event the inequality remains true if we add also a zero to the first 
maximum. It follows that Eyil^ is bounded below by 

E(max(0, Y u Y x + Y 2 , . . . , Y x + ■ ■ ■ + Y n ) - max(0, Y 2 , Y 2 + Y 3 , ■ ■ ■ , Y 2 + ■ ■ ■ + F„ +1 )) l An . 

Off the event A n the first maximum is zero, whereas the second maximum is always 
nonnegative. Thus the expression does not increase if we cancel the indicator 1a„- The 
resulting expression is identically zero by the strict stationarity of the series Y t . m 

Thus a strong law is valid for every integrable strictly stationary sequence, with- 
out any further conditions on possible dependence of the variables. However, the limit 
E(Xo| Z4nv) in the preceding theorem will often be a true random variable. Only if the 
invariant a-field is trivial, we can be sure that the limit is degenerate. Here "trivial" may 
be taken to mean that the invariant a-field consists of sets of probability or 1 only. If 
this is the case, then the time series X t is called ergodic. 

* 4.19 EXERCISE. Suppose that X t is strictly stationary. Show that X t is ergodic if and 
only if every sequence Y t = /(..., X t -i, X t , X t +i, . . .) for a measurable map / that is 
integrable satisfies the law of large numbers Y n — > EYi almost surely. [Given an invariant 
set A = (....I^.Io.Ii,...)" 1 ^) consider Y t = 1 B (. . . ,Xt-i,X t ,X t +i, . . .)■ Then 
Y n = 1 A .] 

Checking that the invariant a-field is trivial may be a nontrivial operation. There 
are other concepts that imply ergodicity and may be easier to verify. A time series Xt is 
called mixing if, for any measurable sets A and B, as h — ► oo, 

P((. . . ,Xh-i,Xh,Xfi+i, ■ ■ •) £ A, (. . . , X-ijXrjj-Xi, ...) £ B) 

-*P({...,X h - 1 ,X h ,X h+1 ,...)£A)P({...,X- 1 ,X ,X 1 ,...)£B). 
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Every mixing time series is ergodic. This follows because if we take A = B equal to an 
invariant set, the preceding display reads P x (A) — > P X (A)P X (A), for P x the law of 
the infinite series X t , and hence P X (A) is or 1. 

The present type of mixing is related to the mixing concepts used to obtain central 
limit theorems, and is weaker. 

4.20 Theorem. Any strictly stationary a-mixing time series is mixing. 

Proof. For i-dimensional cylinder sets A and B in X°° (i.e. sets that depend on finitely 
many coordinates only) the mixing condition becomes 

P((X h , . ..X t+h ) G A, (X , . . . ,X t ) G B) -> P{(X h , . ..X t+h ) G A)P((X , ...,X t )eB). 

For h > t the absolute value of the difference of the two sides of the display is bounded 
by a(h — t) and hence converges to zero as h — ¥ oo, for each fixed t. 

Thus the mixing condition is satisfied by the collection of all cylinder sets. This 
collection is intersection-stable, i.e. a 7r-system, and generates the product a-field on 
X°° . The proof is complete if we can show that the collections of sets A and B for which 
the mixing condition holds, for a given set B or A, is a a-field. By the 7T-A theorem it 
suffices to show that these collections of sets are a A-system. 

The mixing property can be written as P x ((f>^ h A n B) - P x {A)P X {B) ->■ 0, as 
h — > oo. Because ^ is a bijection we have (f>~ h (A 2 — A\) = (f>~ h A2 — <\T h A\. If A\ c A2, 
then 

P x {<j>- h {A 2 -A 1 )r\B) =P x {<j>- h A 2 r\B) -P x (<p- h A 1 nB), 

P X (A 2 - A 1 )P X (B) = P X (A 2 )P X (B) - P X (A 1 )P X (B). 

If, for a given set B, the sets A\ and A 2 satisfy the mixing condition, then the right 
hand sides are asymptotically the same, as h — > 00, and hence so are the left sides. Thus 
A 2 — Ai satisfies the mixing condition. If A n t A, then (f>~ h A n t (f>~ h A as n — > 00 and 
hence 

P x {4>- h A n r\B)- P x (A n )P x (B) -> P x {4>- h A r\B)- P X {A)P X {B). 

The absolute difference of left and right sides is bounded above by 2\P x (A n ) — P X (A)\. 
Hence the convergence in the display is uniform in h. If every of the sets A n satisfies the 
mixing condition, for a given set B, then so does A. Thus the collection of all sets A that 
satisfies the condition, for a given B, is a A-system. 

We can prove similarly, but more easily, that the collection of all sets B is also a 
A-system. ■ 

4.21 Theorem. Any strictly stationary time series X t with trivial tail a-field is mixing. 
Proof. The tail a-field is defined as flhez&iXh, Xh+i, ■ ■ ■)■ 



* 
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As in the proof of the preceding theorem we need to verify the mixing condition 
only for finite cylinder sets A and B. We can write 

Elx h ,...,x t+h eA(lx ,...,x t eB - P(X ,...,X t € B)) 

Rlx h ,...,x t+h eA(P(Xo, . . .,X t € B\X h ,X h+ i, ...) - P(X ,. ..,X t € B)) 

< e|p(x , ...,x t eB\x h , x h+1 , ...)- P(x , ...,x t eB)) 

For every integrable variable Y the sequence E(Y\ Xh, Xh+i, ■ ■ .) converges in Li to the 
conditional expectation of Y given the tail a-field, as h — > oo. Because the tail a-field 
is trivial, in the present case this is FjY. Thus the right side of the preceding display 
converges to zero as h — > oo. ■ 

4.22 EXERCISE. Show that a strictly stationary time series X t is ergodic if and only if 
rT 1 YTh=\ P X {<t>~ h A flB)^ P X (A)P X (B), as m oo, for every measurable subsets A 
and B of X°°. [Use the ergodic theorem on the stationary time series Y t = l<j,t XeA for 
the proof in one direction.] 

* 4.23 EXERCISE. Show that a strictly stationary time series X t is ergodic if and only 
if the one-sided time series X ,Xi,X 2 , ■ ■ ■ is ergodic, in the sense that the "one-sided 
invariant a-field", consisting of all sets A such that A = (X t , X t +i, ■ ■ .)^ 1 (B) for some 
measurable set B and every t > 0, is trivial. [Use the preceding exercise.] 

The preceding theorems can be used as starting points to construct ergodic se- 
quences. For instance, every i.i.d. sequence is ergodic by the preceding theorems, be- 
cause its tail CT-field is trivial by Kolmogorov's 0-1 law, or because it is ce-mixing. To 
construct more examples we can combine the theorems with the following stability prop- 
erty. From a given ergodic sequence Xt we construct a process Y t by transforming the 
vector (. . . ,Xt-i,Xt,Xt+i, . . .) with a given map / from the product space X°° into 
some measurable space (y,B). As before, the X t in (. . . ,X t -i,X t ,X t +i, . . .) is meant 
to be at a fixed zeroth position in Z, so that the different variables Y t are obtained by 
sliding the function / along the sequence (. . . , X t -i, X t , X t +i, . . .). 

4.24 Lemma. The sequence Y t =/(... , X t -i, X t , X t +i, . . .) obtained by application of 
a measurable map f: X°° — » y to an ergodic sequence X t is ergodic. 

4.25 EXERCISE. Let Z t be an i.i.d. sequence of integrable variables and let X t = 
Yjj^jZt-j for a sequence ipj such that V- \ipj\ < oo. Show that X t satisfies the law 
of large numbers (with degenerate limit) . 

4.26 Example. Every stationary irreducible Markov chain on a countable state space 
is ergodic. Conversely, a stationary reducible Markov chain on a countable state space 
whose initial (or marginal) law is positive everywhere is nonergodic. 
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To prove the ergodicity note that a stationary irreducible Markov chain is (pos- 
itively) recurrent (e.g. Durrett, p266). If A is an invariant set of the form A = 
(X ,X 1 ,...)- 1 (B),thea 

P((Xo,Xi, . . .) £ B\ Xh,Xh-i, ■ ■ ■) = P((Xh+i,Xh+2, ■ ■ ■) € B\Xh,Xh-i, ■ ■ ■) 

= p((x h+1 ,x h+2 ,...)eB\x h ). 

We can write the right side as g{Xh) for the function g(x) = P(A\ X_i = x). By one 
of the martingale convergence theorems the left side converges almost surely to 1a as 
h t oo. By recurrence, for almost every w in the underlying probability space, the right 
side runs infinitely often through every of the numbers g{x) with x in the state space. 
This implies that every of these numbers g(x) must be either or 1, and hence that 1a is 
either or 1 with probability one. Thus every invariant set of this type is trivial, showing 

the ergodicity of the one-sided sequence Xq, X\, It can be shown that one-sided and 

two-sided ergodicity are the same. 

Conversely, if the Markov chain is reducible, then the state space can be split into 
two sets Xi and X2 such that the chain will remain in X\ or X2 once it enters there. If the 
initial distribution puts positive mass everyhwere, then each of the two possibilities occurs 
with positive probability. The sets Ai = {X £ Xi} are then invariant and nontrivial and 
hence the chain is not ergodic. 

It can also be shown that a stationary irreducible Markov chain is mixing if and only 
if it is aperiodic. (See e.g. Durrett, p310.) Furthermore, the tail a-field of any irreducible 
stationary aperiodic Markov chain is trivial. (See e.g. Durrett, p279.) □ 

Ergodicity is a powerful, but somewhat complicated concept. If we are only inter- 
ested in a law of large numbers for a given sequence, then it may be advantageous to 
use more elementary tools. For instance, the means X n of any stationary time series X t 
converge in Li to a random variable; this limit is degenerate if and only if the spectral 
mass of the series X t at zero is zero. See Example 6.30. 



4.6 Martingale Differences 

The partial sums Y^t=i Xt of an i.i.d. sequence grow by increments X t that are indepen- 
dent from the "past" . The classical central limit theorem shows that this induces asymp- 
totic normality provided the increments are centered and not too big (finite variance 
suffices). The mixing central limit theorem relax the independence to near independence 
of variables at large time lags, which are conditions involving the whole distribution. 
The martingale central limit theorem given in this section imposes conditions on the 
conditional first and second moments of the increments given the past, without directly 
involving other aspects of the distribution. The first moments given the past are assumed 
zero; the second moments given the past must not be too big. 

A filtration T t is a nondecreasing collection of a-fields • • • c T-\ C^oC^i C---. 
The CT-field T t is to be thought of as the "events that are known" at time t. Often it will 
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be the a-field generated by variables X t ,X t -i,X t -2, The corresponding filtration is 

called the natural filtration of the time series X t , or the filtration generated by this series. 
A martingale difference series relative to a given filtration is a time series Xt such that, 
for every t , 
(i) X t is JVmeasurable; 
(ii) E(X t \Tt-i)=0. 

The second requirement implicitly includes the assumption that E|Jf t | < oo, so that the 
conditional expectation is well-defined; the identity is understood to be in the almost-sure 
sense. 

4.27 EXERCISE. Show that a martingale difference series with finite variances is a white 
noise series. 

4.28 Theorem. Let X t be a martingale difference series relative to the filtration 
Tt such that n^ 1 J2t=i ^(-^t\ Ft-i) ^ v for a positive constant v, and such that 



n 



-i 



E?=i E { x ?H\Xt\ > £\/"}l ?t-i) 4 for every e > 0. Then ^X n ~» 7V(0, v). 



4.29 EXERCISE. Let X t be a strictly stationary, ergodic martingale difference series 
relative to its natural filtration with mean zero and v = EX% < oo. Show that \fnX n ~-> 
N(0,v). 



* 4.7 Projections 

Let X t be a centered time series and Tq = a(Xo,X-i, . . .). For a suitably mixing time 
series the covariance E(X„E(Xj| To)) between X n and the best prediction of Xj at time 
should be small as n — > oo. The following theorem gives a precise and remarkably 
simple sufficient condition for the central limit theorem in terms of these quantities. 

4.30 Theorem, let Xt be a strictly stationary, mean zero, ergodic time series with 
Sft|7x(/i)| < oo and, as n — ► oo, 

£|E(X„E(X,|.Fo))|->0. 

j=0 

Then JnX n ~» 7V(0, v), for v = Y. h lx{h). 
Proof. For a fixed integer m define a time series 

t-\-m 

Y t , m = J2 (E(^| Tt) - E(X,| Ft-i)) ■ 
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Then Y ttTn is a strictly stationary martingale difference series. By the ergodicity of the 
series X t , for fixed m as n — > oo, 



^ X>(y t 2 j ^_!) ^ Ey 2 m =: ^ 



n 

t=i 



almost surely and in mean. The number v m is finite, because the series X t is square- 
integrable by assumption. By the martingale central limit theorem, Theorem 4.28, we 
conclude that y/nY n ^ m ~-> 7V(0,i> m ) asn-> oo, for every fixed m. 
Because X t = E(X t | !F t ) we can write 

n n t-{-m n t-\-m 

t=l t=l j=t-\-l t=l j=t 

n-\-m m n 

j = 71 + l j = l * = 1 

Write the right side as Z n , m — Zo,m — R n ,m- Then the time series Z tym is stationary with 

m m 

EZl m = ^^E(E(X,| J- )E(^| T )) < m 2 EX 2 . 
The right side divided by n converges to zero as n — ► oo, for every fixed m. Furthermore, 
E-^n.m = /, /, E(E(X, +m | JF s _ 1 )E(X t+m | .Tt-i) j 

s=l t=l 

< 2 2_^Z-*i E(E(X s _|_ m | ^ r s _i)Xt+ Tn J 

l<S<t<7l 

oo oo 

<2n£|EE(X ra+ i|J r o)^fc+ m |=2n £ |EX m+ iE(X h |^ )|. 

/l=l /l=77l + l 

The right side divided by n converges to zero as m — » oo. Combining the three preceding 
displays we see that the sequence \fn{Y niTn — 1„) = (Z„ >m — Zo,™ — Rn,m)/\/n converges 
to zero in second mean as n — ► oo followed by m — ► oo. 

Because Y tym is a martingale difference series, the variables Y tym are uncorrelated 
and hence 

var VnY n , m = El^ = v m . 

Because, as usual, var\/rjX n ->«asn-> oo, combination with the preceding paragraph 
shows that v m — ► v as m — ► oo. Consequently, by Lemma 3.10 there exists m n — ► oo 
such that \fnY n ,m n ~~+ AT(0, w) and \Jn{Y n ^ mn — X n ) ~-+ 0. This implies the theorem in 
view of Slutsky's lemma. ■ 
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4.31 Example. We can use the preceding theorem for an alternative proof of the a- 
mixing central limit theorem, Theorem 4.7. The absolute convergence of the series 
^2h^x{h) can be verified under the condition of Theorem 4.7 as in the first lines of 
the proof of that theorem. We concentrate on the verification of the displayed condition 
of the preceding theorem. Set Y n = E(X n \ JF ) and 

l-U n =F ]Ynl {\Y n \-)+VAF ]Ynl {\Y n \), 

where AF denotes the jump sizes of a cumulative distribution function and V is a uniform 
variable independent of the other variables. The latter definition is an extended form of 
the probability integral transformation, allowing for jumps in the distribution function. 
The variable U n is uniformly distributed and F.^Al — U n ) = \Y n \ almost surely. Because 
Y n is J-o-measurable the covariance inequality, Lemma 4.11, gives 

|E(EpT n | Fo)Xj) | < 2 I ' Jfj-^l - u)F|^|(l - u) du 

= 2EY n sign(Y n )F lx 1 }l (l - U n )l Un<0i} 
= 2EX Jl sign(y jl )F i ^ | (l - U n )l Un<a] 
<2E\X n \F lx 1 ]l (l-U n )l Un<a} 

<4 J F |x 1 jj| (l-w)G- 1 (l-w)dw 



by a second application of Lemma 4.11, with a = 1 and G the distribution function of the 
random variable F^.Al — U n )lu n <a r The corresponding quantile function G _1 (l — u) 
vanishes off [0, ctj] and is bounded above by the quantile function of \Xj\. Therefore, the 
expression is further bounded by f^ 3 F^} ,(1 — u) 2 du. We finish by summing up over j. 
□ 



5 

Nonpar ame trie Estimation 

of Mean and Covariance 



Suppose we observe the values Xi,..., X n from the stationary time series X t with mean 
p x = EXt , covariance function jx (h) and correlation function px(h). If nothing is known 
about the distribution of the time series, besides that it is stationary, then "obvious" 
estimators for these parameters are 

— 1 ™ 

fan = x n = — y x t , 



n 
t=l 



.. n— h 

%{h) = ~ J2( Xt + h ~ *»)(*t - X n ), (0<h< n), 



* 



Pn{h) = uv ■ 

In this chapter we study some of their properties. 

These estimators are called nonparametric, because they are not motivated by a 
statistical model that restricts the distribution of the time series. The advantage is that 
they work for (almost) every stationary time series. However, given a statistical model, 
it might be possible to find better estimators for \xx, 7x and px- We shall see examples 
of this when discussing ARMA-processes in a later chapter. 

5.1 EXERCISE. The factor \jn in the definition of 7„(/i) is sometimes replaced by 
l/(n — h), because there are n — h terms in the sum. Show that with the present def- 
inition of % the corresponding estimate (j n (s — £)) . , . for the covariance matrix 
of (Xi, . . . , Xh) is nonnegative-definite. Show by example that this is not true if we use 
l/(n — h). [Write the matrix as QQ T for a suitable (n x (2n)) matrix Q.] 

The time series X t is called Gaussian if the joint distribution of any finite number 
of the variables X t is multivariate-normal. In that case X n is normally distributed. The 
distributions of %(h) and p n (h) are complicated, even under normality. Distributional 
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Figure 5.1. Realization of the sample auto-correlation function (r 
satisfying Xt-\-i = 0.5Xt + Zt for standard normal white noise Z>t . 



250) of the stationary time series 



statements considering these estimators are therefore usually asymptotic in nature, as 
n — > oo. In this chapter we discuss conditions under which each of the three estimators 
are asymptotically normally distributed. This knowledge can be used to set approximate 
confidence intervals. 



5.1 Sample Mean 

The asymptotic normality of the sample mean is the subject of Chapter 4. In this sec- 
tion we discuss estimating its (asymptotic) variance, which is necessary to construct a 
confidence interval based on the sample mean. 

An approximate confidence interval for nx based on the sample mean typically takes 
the form 

Ur, 

-^i.yo,A„+ , 

If \fn{X n — nx)/<J n ~» iV(0, 1) as n — > oo, then the confidence level of this interval 
converges to 95%. The problem is to find suitable estimators a n . 



(X n - -^1.96, X n + ^1.96). 
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If the sequence y/n(X n — fix) is asymptotically normal, as it is under the conditions 
of the preceding chapter, the procedure works if the a n are consistent estimators of the 
(asymptotic) variance of X n . Unlike in the case of independent, identically distributed 
variables, the variance of the sample mean depends on characteristics of the joint dis- 
tribution of (Xi, . . . ,X n ), rather than only on the marginal distributions. (See (4.1).) 
The limiting variance ^2 h Jx(h) depends even on the joint distribution of the infinite 
sequence (Xi,X 2 , . . .)■ With a sufficient number of observations it is possible to estimate 
the auto-covariances Jx(h) at smaller lags h, but, without further information, this is 
not true for larger lags h rj n (let alone h > n), unless we make special assumptions. 
Setting a confidence interval is therefore much harder than in the case of independent, 
identically distributed variables. 

If a reliable model is available, expressed in a vector of parameters, then the problem 
can be solved by a model-based estimator. We express the variance of the sample mean 
in these parameters, and next plug in estimates for these parameters. If there are not too 
many parameters in the model this should be feasible. (Methods to estimate parameters 
are discussed in later chapters.) 

5.2 EXERCISE, 
(i) Calculate the asymptotic variance of the sample mean for the moving average X± = 

Zt+eZt-i. 
(ii) Same question for the stationary solution of X t = (f>Z t -i + Z t , where \<j>\ < 1. 

However, the use of a model-based estimator is at odds with the theme of this chap- 
ter: nonparametric estimation. It is possible to estimate the variance nonparametrically 
provided the time series is sufficiently mixing. We discuss several methods. 

A commonly used method is the method of batched means. The total set of ob- 
servations is split into r blocks [Xi, . . . , X{], [Xi+i, . . . , X21], ■ ■ ■ , [-X^-iji+i; ■ ■ ■ , X T {\ of I 
observations. (Assume that n = rl for simplicity; drop a last batch of fewer than I ob- 
servations.) If Yi, . . . , Y r are the sample means of the r blocks, then Y r = X n and hence 
var Y r = var X n . The hope is that we can ignore the dependence between Y\,...,Y T and 
can simply estimate the variance var(-^/Fy r .) by the sample variance S^ Y of Yi, . . . ,Y r . 
If I is "large enough" and the orginal series X t is sufficiently mixing, then this actually 
works, to some extent. 

Presumably, the method of batched means uses disjoint blocks of X t in order to 
achieve the approximate independence of the block means Y\,...,Y T used for its mo- 
tivation. In general the block means are still dependent. This does not cause much 
(additional) bias in the estimate of the variance, but it may have an effect on the 
precision. It turns out that it is better to use all blocks of I consecutive X t , even 
though these may be more dependent. Thus in our second method we consider all blocks 
[Xi, . . . , Xi], [X2, • • • , Xi + i], . . . , [X n -i + i, . . . , X n ] of I consecutive observations. We let 
Zi,Z2, ■ ■ ■ , Z n -i + i be the sample means of the n — I + 1 blocks, and estimate the vari- 
ance of \/nX n by IS^_ 1+1 z . The following theorem shows that this method works under 
some conditions, provided that / is chosen dependent on n with l n — > 00 at a not too fast 
rate. Because in the theorem / depends on n, so do the block means, and we denote them 
by Z ny i, . . . , Zn t n-i„+i- The theorem considers both the sample variance of the block 
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means, 

n-l„+l 

S lz = „ - | n + i ^ (2„, i -Z„_ J . +1 ) 2 , 

i=l 

and the centered empirical distribution function of the block means, 

n-i„+l 

F -( x ) = _, Z1 X! !{ V^(^n,< - X.) < a:}. 

z=l 

5.3 Theorem. Suppose that the time series Xt is strictly stationary and a-mixing with 
mixing coefficients satisfying ^ h a{h) < oo. Let l n — > oo such that /„/ti — > 0. Further- 
more, suppose that \fn{X n — fix) ~~* AT(0, v), for some number v. Then, for every x, the 
sequence F n {x) converges in probability to Q>{x/y/v). Furthermore, ifv = J2h 7* W and 
Sh|7x(ft)| < oo, then the variance l n S\ z of F n converges in probabity to v. 

Proof. Let G n be the distribution function obtained by replacing the average X n in the 
definition of F n by fix- These functions are related through F n (x) = G n (x + y/l^(X n — 
fix))- The sequence y/l^(X n — fix) converges in probabity to zero, by the assumptions 
that the sequence ^/n(X n — fix) converges weakly and that l n /n — > 0. In view of the 
monotonicity of the functions F n and G n it suffices to show that G n (x) i^. ${x/y/v) for 
every x. 

Fix some x and define Y t = l{-\Zl^(Z n ,t — fix) < x}- Then the time series Y t is 
strictly stationary and G n (x) = Y n _i n+ i. By assumption 

EF„_ in+1 = P(vC(*i. - fix) <x)^ $(x/y/v). 

Because the variable Y t depends only on the variables X s with t < s < t + l n , the series 
Y t is a-mixing with mixing coefficients bounded above by a(h — l n ) for h > l n . Therefore, 
by (4.2) followed by Lemma 4.11 (with q = r = oo), 

h h>l n 

This converges to zero asn-^oo. Thus G n (x) = Y n _i n+ i — > <&{x/\/v) in probability by 
Chebyshev's inequality, and the first assertion of the theorem is proved. 

To prove the convergence of the variance of F n , we first note that the variances of 
F n and G n are the same. Because G n ~+ N(0,v), Theorem 3.8 shows that the variance 
of G n converges to v if and only Ji x i >m x 2 dG n (x) ^ as n — > oo followed by M — > oo. 
Now 

„ -. n-l n +l 

E x 2 dG n (x)=E — — V \VL(Z nji -fi x )\ 2 l{VL\Z n ,i-fix\>M} 

J\x\>m n-/„ + l ^ 

= E|vC(*j. - fix)\ 2 l{VUX ln -fi x \> Af }. 

By assumption \/l^(Xi n - fix) ^ N(0,v), while E\y/l^(X i n - fix)\ -> u by (4.1). Thus 
we can apply Theorem 3.8 in the other direction to conclude that the right side of the 
display converges to zero as n — > oo followed by M — ► oo. ■ 
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The usefulness of the estimate F n goes beyond its variance. For instance, we could 
use the quantiles of F n as estimators of the quantiles of \fn{X n — jix) and use these to 
replace the normal quantiles and a n in the construction of a confidence interval. This 
gives the interval 

\-y K 1 (0-975) - F' 1 (0.025) ] 
P™ ^ ' X ™ ^ J' 

The preceding theorem shows that this interval has asymptotic confidence level 95% for 
covering nx- 

Another, related method is the blockwise bootstrap. Assume that n = lr for simplic- 
ity. Given the same blocks [Xi, . . . , X{\, [X2, ■ ■ ■ , ATj+i], • • • , [X n -i+i, ■ ■ ■ , X n ], we choose 
r = n/l blocks at random with replacement and put the r blocks in a row, in any order, 
but preserving the order of the X t within the r blocks. We denote the row of n = rl vari- 
ables obtained in this way by X*, X%, . . . , X* and let X* be their average. The bootstrap 
estimate of the distribution of y/n(X n — jjlx) is by definition the conditional distribu- 
tion of \/n(X* — X n ) given X\,..., X n . The corresponding estimate of the variance of 
\fn(X n — nx) is the variance of this conditional distribution. 

Another, but equivalent, description of the bootstrap procedure is to choose a ran- 
dom sample with replacement from the block averages Z„ t i, . . . , Z n ^_i n+ \. If this sample 
is denoted by Z* , . . . , Z* , then the average X* is also the average Z* . It follows that the 
bootstrap estimate of the variance of X n is the conditional variance of the mean of a 
random sample of size r from the block averages given the values Z n ,i, ■ ■ ■ , Z ni7l -i n +i of 
these averages. This is simply {n/r)S^_ t +1 z , as before. 

Other aspects of the bootstrap estimators of the distribution, for instance quantiles, 
are hard to calculate explicitly. In practice we perform computer simulation to obtain 
an approximation of the bootstrap estimate. By repeating the sampling procedure a 
large number of times (with the same values of X\, . . . , X n ), and taking the empirical 
distribution over the realizations, we can, in principle obtain arbitrary precision. 

All three methods discussed previously are based on forming blocks of a certain 
length /. The proper choice of the block length is crucial for their succes: the preceding 
theorem shows that (one of) the estimators will be consistent provided l n — ¥ 00 such that 
l n /n — ► 0. Additional calculations show that, under general conditions, the variances of 
the variance estimators are minimal if l n is proportional to n 1 / 3 } 

5.4 EXERCISE. Extend the preceding theorem to the method of batched means. Show 
that the variance estimator is consistent. 



5.2 Sample Auto Covariances 

Replacing a given time series X t by the centered time series X t — \xx does not change 
the auto-covariance function. Therefore, for the study of the asymptotic properties of 



See Kiinsch (1989), Annals of Statistics 17, pl217-1241. 
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the sample auto covariance function %(h) : it is not a loss of generality to assume that 
fix = 0. The sample auto-covariance function can be written as 

1 n—h 1 n—h 1 n 

%( h ) = -Y,x t+h x t -x n [-Y,x t )-{- £ x t )x n + (x n r. 

t=l t=l t=h+l 

Under the conditions of Chapter 4 and the assumption jjlx = 0, the sample mean X n 
is of the order P {ll^/n) and hence the last term on the right is of the order P (l/n). 
For fixed h the second and third term are almost equivalent to (X n ) 2 and are also of the 
order Op(l/n). Thus, under the assumption that jjlx = 0, 



1 ?~ fc i 

%(h) = - V Xt+ h Xt + Op (-) . 
t=i 



It follows from this and Slutsky's lemma that the asymptotic behaviour of the sequence 
\fn{fy n {h) — 'jx(h)) depends only on rT 1 J2t=i X t +hX t . Here a change of n by n — h 
(or n — h by n) is asymptotically negligible, so that, for simplicity of notation, we can 
equivalently study the averages 



m) = -Yx t+h x t . 



In 

n 

t=l 

These are unbiased estimators of EXt+hXt = 'Jx(h), under the condition that jjlx = 0. 
Their asymptotic distribution can be derived by applying a central limit theorem to the 
averages Y n of the variables Y t = Xt+hXt- 

If the time series X± is mixing with mixing coefficients a(fc), then the time series Y t 
is mixing with mixing coefficients bounded above by a(k — h) for k > h > 0. Because 
the conditions for a central limit theorem depend only on the speed at which the mixing 
coefficients converge to zero, this means that in most cases the mixing coefficients of 
X t and Y t are equivalent. By the Cauchy-Schwarz inequality the series Y t has finite 
moments of order k if the series X t has finite moments of order 2k. This means that the 
mixing central limit theorems for the sample mean apply without further difficulties to 
proving the asymptotic normality of the sample auto-covariance function. The asymptotic 
variance takes the form X^li^s) an< ^ m general depends on fourth order moments of 
the type EiX t -\- g -\-hXt+ g Xt+hXt as well as on the auto-covariance function of the series 
Xt- In its generality, its precise form is not of much interest. 

5.5 Theorem. IfX t is a strictly stationary, mixing time series with a-mixing coefficients 
such that J a^ 1 (u)F,' x 1 x ,(1 — u) 2 du < oo, then the sequence \/n(^ n {h) — "fx(h)) 
converges in distribution to a normal distribution. 

Another approach to central limit theorems is special to linear processes, of the form 

oo 

(5.1) X t =»+ Y, iPj Z t-J- 
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Here we assume that . . . , Z i, Zq, Z\, Z 2 , ■ ■ ■ is a sequence of independent and identically 
distributed variables with EZ t = 0, and that the constants ipj satisfy V • \ipj\ < 00. The 
sample auto-covariance function of a linear process is also asymptotically normal, but 
the proof of this requires additional work. This work is worth while mainly because the 
limit variance takes a simple form in this case. 

Under (5.1) with \x = 0, the auto-covariance function of the series Y t = X t+ hX t can 
be calculated as 

1y{9) = COV (Xt+g+hXt+g, X t+h Xt) 

= X] X! X] X] ^t-ilpt+h-jlpt+g-klpt+g+h-l COv(ZiZj, Z k Zl). 
i j k I 

Here cov(ZiZj,ZkZi) is zero whenever one of the indices i,j,k,l occurs only once. For 
instance EZ 1 Z 2 Z W Z 2 = EZ^Z^EZw = 0. It also vanishes if i = j ^ k = I. The 
covariance is nonzero only if all four indices are the same, or if the indices occur in the 
pairs i = k ^ j = I or i = I ^ j = k. Thus the preceding display can be rewritten as 

cov(Zi , Z\ ) X ■4 > t-ii>t+h-ii>t+g-ii>t+ g +h-i 

i 
+ COV (Z 1 Z 2 ,Z 1 Z 2 ) ^^tpt-ilpt+h-jlpt+g-ilpt+g+h-j 

+ COV (Z 1 Z 2 ,Z 2 Z 1 ) X X tPt-ilpt+h-jIpt+g-jIpt+g+h-i 

= (EZf ~ 3(E^ 2 ) 2 ) X AA + hA+ g A+ g +h + lx(g) 2 +lx(g + h) lx {g - h). 

i 

In the last step we use Lemma 1.27(iii) twice, after first adding in the diagonal terms 
i = j into the double sums. Since cov(ZiZ 2 ,ZiZ 2 ) = (EZf) 2 , these diagonal terms 
account for —2 of the —3 times the sum in the first term. The variance of 7*(/i) = Y n 
converges to the sum over g of this expression. With K\{Z) = EZf/(EZi) 2 — 3, the fourth 
cumulant (or kurtosis) of Z±, this sum can be written as 

V Kh = K 4 {Z)-yx{h) 2 + X^(s) 2 +^2lx(g + h) lx {g - h). 



5.6 Theorem. Suppose that (5.1) holds for an i.i.d. sequence Z t with mean zero and 
EZf < 00 and numbers ipj with V • \ipj\ < 00. Then y/ni^^h) — "fx{h)) ~-> N(0, Vh t h)- 

Proof. As explained in the discussion preceding the statement of the theorem, it suffices 
to show that the sequence \fn(f)^(h) — ^)x{h)) has the given asymptotic distribution in 
the case that [x = 0. Define Y t = X t +hX t and, for fixed m e N, 

y™ = X Wt+H-i X ^ z *-j = x ?+*.x?. 
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The time series Y™ is (2m + h + Independent and strictly stationary. By Theorem 4.4 
the sequence y/n(Y™ — FY™) is asymptotically normal with mean zero and variance 

CT ™ = E T*» 0?) = K *( z hx™ (h) 2 + 22 7x™ (a) 2 + E ^ xm 0? + %*" {g-h), 

a 99 

where the second equality follows from the calculations preceding the theorem. For every 
<?, as m — ► 00, 

7x- (5) = FZ 2 Y, Mi+a "»• E2 i I] lM*+» = 7x (5) • 

j:\j\<m,\j+g\<m j 

Furthermore, the numbers on the left are bounded above by FZ\ V • IV'jV'j+sl; and 

E(Ei^^+si) = EE E i^^+^+si - sup i^i(Ei^0 <o °- 

9 3 g i k 3 

Therefore, by the dominated convergence theorem J] jx™ (g) 2 —> XL lx (g) 2 asm->oo. 
By a similar argument, we obtain the corresponding property for the third term in the 
expression defining 0^, whence ct^ — > Vh y h as m -> 00. 

We conclude by Lemma 3.10 that there exists a sequence m n — > 00 such that 
\fn(y™ n — FY™") ~-> N(0, Vh,h)- The proof of the theorem is complete once we also have 
shown that the difference between the sequences \/n(Y n — FY n ) and ^/n(Y™ n — EY™") 
converges to zero in probability. 

Both sequences are centered at mean zero. In view of Chebyshev's inequality it 
suffices to show that nvar(Y n — Y™") — > 0. We can write 

Y t - Y t m = Xt+hXt - X£l h X™ = } j 2 j ip^^ t+h _ i Z i Zj, 

i 3 

where ip™j = ipiipj if \i\ > m or \j\ > m and is otherwise. The variables Y n — Y™ are 
the averages of these double sums and hence -Jn times their variance can be found as 

g=— n 

= E (^)EEEE^, t+ ,-^r +s - M+fl+ ^cov(^^,z fe ^). 

g=—n ' i 3k I 

Most terms in this five-fold sum are zero and by similar arguments as before the whole 
expression can be bounded in absolute value by 

coy(z 2 ,z 2 ) 22 22 m + ^2 9+h \+^ z lf YYY m^ +9 , J+9 1 

9 i 9 i j 

+ (e^ 2 ) 2 22YY m + ^T + 9, +a+h \- 
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We have that V>™' — ► as m — > oo for every fixed (i, j), IV'Tjl < IV'iV'jl; an d sup^ \ipi\ < oo. 
By the dominated convergence theorem the double and triple sums converge to zero as 
well. ■ 

By similar arguments we can also prove the joint asymptotic normality of the sample 
auto-covariances for a number of lags h simultaneously. By the Cramer- Wold device a 
sequence of fc-dimensional random vectors X n converges in distribution to a random 
vector X if and only if a T X n ~-> a T X for every a £ M k . A linear combination of sample 
auto-covariances can be written as an average, as before. These averages can be shown to 
be asymptotically normal by the same methods, with only the notation becoming more 
complex. 

5.7 Theorem. Under the conditions of either Theorem 5.5 or 5.6, for every h £ N and 
some (h + 1) x (h + l)-matrix V , 

'7*(0)\\ 

: ^7Vk +1 (0,V). 

\%(h)J \lx{h)j) 

For a linear process X t the matrix V has (g, h)-element 

V g>h = K A {Z) lx {g) lx {h) + Y,lx(k + g) lx (k + /i) + ^ 7x (fc - g)j x (k + h). 




5.3 Sample Auto Correlations 

The asymptotic distribution of the auto-correlations p n (h) can be obtained from the 
asymptotic distribution of the auto-covariance function by the Delta-method (Theo- 
rem 3.14). We can write 

for (j) the function (f>(u, v) = v/u. This function has gradient (—v/u 2 , 1/u). By the Delta- 
method, 

yfciMh) - p x (h)) = -^W ^(7n(0) - 7 x(0)) 
7x(UJ 

The limit distribution of the right side is the distribution of the random variable 
-'Jx(h)/'y x (0) 2 Yo + l/-y x (0)Y h for Y a random vector with the Nh+i(0, V)-distribution 



70 5: Nonparametric Estimation of Mean and Covariance 

given in Theorem 5.7. The joint limit distribution of a vector of auto-correlations is the 
joint distribution of the corresponding linear combinations of the Y^. By linearity this is 
a Gaussian distribution; its mean is zero and its covariance matrix can be expressed in 
the matrix V by linear algebra. 

5.8 Theorem. Under the conditions of either Theorem 5.5 or 5.6, for every h e N and 
some h x h-matrix W, 



\fn 



//3 n (l)\ /Mi)' 



N h (0,W), 
\pn{h)J \px(h) j 

For a linear process X t the matrix W has (g, h)-element 



W 9,h = Yl \p x ( fc + 9)px (k + h)+ p x (k - g)p x (k + h) + 2p x (g)px {h)p x (k) 2 

k 

- 2 Px {g)px (k)p x (k + h)- 2p x {h)p x {k)p x (* + »)]■ 

The expression for the asymptotic covariance matrix W in the case of a linear 
process is known as Bartlett's formula. An interesting fact is that W depends on the 
auto-correlation function px only, although V depends also on the second and fourth 
moments of Z\ . We discuss two interesting examples of this formula. 

5.9 Example (lid sequence). For Vo = 1 and ipj = for j ^ 0, the linear process X t 
given by (5.1) is equal to the i.i.d. sequence p + Z t . Then px{h) = for every h ^ and 
the matrix W given by Bartlett's formula reduces to the identity matrix. This means that 
for large n the sample auto-correlations p„(l), . . . , p n {h) are approximately independent 
normal variables with mean zero and variance 1/n. 

This can be used to test whether a given sequence of random variables is indepen- 
dent. If the variables are independent and identically distributed, then approximately 
95 % of the computed auto-correlations should be in the interval [— 1.96/ y/n, 1.96/ y/n\. 
This is often verified graphically, from a plot of the auto-correlation function, on which 
the given interval is indicated by two horizontal lines. Note that, just as we should expect 
that 95 % of the sample auto-correlations are inside the two bands in the plot, we should 
also expect that 5 % of them are not! A more formal test would be to compare the sum of 
the squared sample auto-correlations to the appropriate chisquare table. The Ljung-Box 
statistic is defined by 

E n± ^P^f 



..-h 

h=l 



By the preceding theorem, for fixed k, this sequence of statistics tends to the % 2 dis- 
tribution with k degrees of freedom, as n — > oo. (The coefficients n(n + 2)/(ra — h) 
are motivated by a calculation of moments for finite n and are thought to improve the 
chisquare approximation, but are asymptotically equivalent to n.) 
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The more auto-correlations we use in a procedure of this type, the more information 
we extract from the data and hence the better the result. However, the tests are based 
on the asymptotic distribution of the sample auto-correlations and this was derived 
under the assumption that the lag h is fixed and n — > oo. We should expect that the 
convergence to normality is slower for sample auto-correlations p n (h) of larger lags h, 
since there are fewer terms in the sums defining them. Thus in practice we should not 
use sample auto-correlations of lags that are large relative to n. a 
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Figure 5.2. Realization of the sample auto-correlation function of a Gaussian white noise series of length 
250. 

5.10 Example (Moving average). For amoving average X t = Z t +6iZ t -i-\ \-6 q Z t - q 

of order q, the auto-correlations px {h) of lags h > q vanish. By the preceding theorem 
the sequence \fnp n (h) converges for h > q in distribution to a normal distribution with 
variance 

W h , h = Y,Px(.k) 2 = 1 + 2p x (l) 2 + ■■■ + 2p x (q) 2 , h > q. 

k 

This can be used to test whether a moving average of a given order q is an appropriate 
model for a given observed time series. A plot of the auto-correlation function shows 
nonzero auto-correlations for lags 1, . . . , q, and zero values for lags h > q. In practice we 
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plot the sample auto-correlation function. Just as in the preceding example, we should ex- 
pect that some sample auto-correlations of lags h > q are significantly different from zero, 
due to the estimation error. The asymptotic variances Wh,h are bigger than 1 and hence 
we should take the confidence bands a bit wider than the intervals [— 1.96/yn, 1.9&l^/n\ 
as in the preceding example. A proper interpretation is more complicated, because the 
sample auto-correlations are not asymptotically independent. □ 
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Figure 5.3. Realization (71 = 250) of the sample auto-correlation function of the moving average process 
X t = 0.5Z t -f- 0.2Z t _i -f- 0.5Z t _2 for a Gaussian white noise series Z t . 

5.11 EXERCISE. Verify the formula for Wh,h in the preceding example. 

5.12 EXERCISE. Find W 1A as a function of 6 for the process X t = Z t + 8Z t -i. 

5.13 EXERCISE. Verify Bartlett's formula. 
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5.4 Sample Partial Auto Correlations 

By Lemma 2.33 and the prediction equations the partial auto-correlation ax{h) is the 
solution <j>h of the system of equations 



/ 7x(0) 7x (l) ••• -y x (h-l)\ / 4>i\ ( IxiX) 




\lx{h-l) lx{h-2) ••• 7x (0) J \cf> h J \j x (h) 



A nonparametric estimator a n (h) of ax(h) is obtained by replacing the auto-covariance 
function in this linear system by the sample auto-covariance function -j n . This yields 
estimators (f>i , . . . , (f>h of the prediction coefficients satisfying 



7ti(0) 7„(1) ••• %(h 



J n {h-l) %(h-2) •■• 7„(0) 




/7n(l)\ 



\Uh)J 



Then we define a nonparametric estimator for ax(h) by a n (h) = 4>h- 

If we write these two systems of equations as T(f> = 7 and T(f> = 7, respectively, then 
we obtain that 

4>-4> = f- 1 ^ - r _1 7 = f -1 ^ - 7) - f _1 (f - r)r _1 7. 

The sequences V"(7 — 7) an< i Vnft ~ T) are asymptotically normal by Theorem 5.7. 
With the help of Slutsky's lemma we readily obtain the asymptotic normality of the 
sequence ^/n(<j) — <f>) and hence of the sequence -^/n^a n (h) — otx(h)). The asymptotic 
covariance matrix appears to be complicated, in general; we shall not derive it. 

5.14 Example (Auto regression). For the stationary solution to X t = (f>X t -i + Z t and 
\(f>\ < 1, the partial auto-correlations of lags h > 2 vanish, by Example 2.34. We shall 
see later that in this case the sequence ^/na n {h) is asymptotically standard normally 
distributed, for every h>2. 

This result extends to the "causal" solution of the pth order auto-regressive scheme 

X t = (f>iX t -i -\ h 4> p X t -p + Z t and the auto-correlations of lags h> p. (The meaning 

of "causal" is explained in Chapter 7.) This property is used to find an appropriate order 
p when fitting an auto- regressive model to a given time series. The order is chosen such 
that "most" of the sample auto-correlations of lags bigger than p are within the band 
[-1.96/y/n,1.96/y/n\. □ 
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Figure 5.4. Realization (71 = 250) of the partial auto-correlation function of the stationary 
Xt = 0.5Xt — i + 0.2Xt — i + Zt for a Gaussian white noise series. 



ulution to 



6 

Spectral Theory 



Let Xt be a stationary, possibly complex, time series with auto-covariance function jx- 
If the series ^ h \ix{h)\ is convergent, then the series 

is absolutely convergent, uniformly in A £ M. This function is called the spectral density 
of the time series X t . Because it is periodic with period 2-ir it suffices to consider it on 
an interval of length 2tt, which we shall take to be (—ir,ir\. In the present context the 
values A in this interval are often referred to as frequencies, for reasons that will become 
clear. By the uniform convergence, we can exchange the order of sum and integral when 
computing f* e thx fx{X) dX and we find that, for every h e Z, 

-yx(h)= r e ihX f x (X)dX. 



Thus the spectral density fx determines the auto-covariance function, just as the auto- 
covariance function determines the spectral density. 

6.1 EXERCISE. Prove this inversion formula, after first verifying that J* e zhx dX = 
for integers h ^ and J* e zhx dX = 2n for h = 0. 

In analysis the series fx is called a Fourier series and the numbers 'yx (h) are called 
the Fourier coefficients of fx- (The factor l/(2ir) is sometimes omitted or replaced by 
another number, and the Fourier series is often defined as fx(—X) rather than fx{X), 
but this is inessential.) A main topic of Fourier analysis is to derive conditions under 
which a Fourier series converges, in an appropriate sense, and to investigate whether the 
inversion formula is valid. We have just answered these questions under the assumption 
that X^ll-fC 1 )! < °°- This condition is more restrictive than necessary, but is sufficient 
for most of our purposes. 
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6.1 Spectral Measures 

The requirement that the series ^ h 7x(/i) is absolutely convergent means roughly that 
lx(h) — t as h — ¥ ±oo at a "sufficiently fast" rate. In statistical terms it means that 
variables X± that are widely separated in time must be approximately uncorrelated. 
This is not true for every time series, and consequently not every time series possesses a 
spectral density. However, every stationary time series does have a "spectral measure" , 
by the following theorem. 

6.2 Theorem (Herglotz). For every stationary time series X t there exists a unique 
Unite measure Fx on (—n, n] such that 



jx(h)= J e ihX dF x {\\ he 



:{h)=f e ih 

J ( — 7T,7r] 

Proof. Define F n as the measure on [— 7r, n] with Lebesgue density equal to 



It is not immediately clear that this is a real-valued, nonnegative function, but this 
follows from the fact that 



2im \^— ' / 2-im 

t=l s=l t=i 



It is clear from the definition of /„ that the numbers 7x(/i)(l — |/i|/n) are the Fourier 
coefficients of f n for \h\ < n (and the remaining Fourier coefficients of f n are zero). Thus, 
by the inversion formula, 

lx {h)(l ~^)= f e lhx /„(A) d\ = j" e* hx dF n (X), \h\ < n. 

Setting h = in this equation, we see that F n [— n, n] = 7x(0) for every n. Thus, apart 
from multiplication by the constant 7x(0), the F n are probability distributions. Because 
the interval [— n, tt] is compact, the sequence F n is uniformly tight. By Prohorov's theorem 
there exists a subsequence F n < that converges weakly to a distribution F on [— 7r, 7r]. 
Because A i-> e lhx is a continuous function, it follows by the portmanteau lemma that 

/ e ihx dF{\)= lim / e ihx dF n {\)= lx {h), 

J[-TT,Tr] n '^°° J[-7T,7r] 

by the preceding display. If F puts a positive mass at — 7r, we can move this to the point 
-it without affecting this identity, since e~ th7r = e lfl7r for every h € Z. The resulting F 
satisfies the requirements for Fx- 

That this F is unique can be proved using the fact that the linear span of the 
functions A i-> e zhx is uniformly dense in the set of continuous, periodic functions (the 
Cesaro sums of the Fourier series of a continuous, periodic function converge uniformly), 
which, in turn, are dense in L\{F). We omit the details of this step, which is standard 
Fourier analysis. ■ 



6.1: Spectral Measures 77 

The measure Fx is called the spectral measure of the time series X t . If the spectral 
measure Fx admits a density fx relative to the Lebesgue measure, then the latter is 
called the spectral density. A sufficient condition for this is that the series ^2^x{h) is 
absolutely convergent. Then the spectral density is the Fourier series with coefficients 
jx (h) introduced previously." 

6.3 EXERCISE. Show that the spectral density of a real-valued time series with 
Sfell^C 1 )! < oo is symmetric about zero. 

* 6.4 EXERCISE. Show that the spectral measure of a real- valued time series is symmetric 
about zero, apart from a possible point mass at it. (You need to know Fourier theory to 
do this.) 

* 6.5 EXERCISE. Show that every finite measure on (—it, it] is the spectral measure of 
some stationary time series. 

6.6 Example (White noise). The covariance function of a white noise sequence X t is 
for h 7^ 0. Thus the Fourier series defining the spectral density has only one term and 
reduces to 

fxW = ^lx(0). 

The spectral measure is a uniform measure. Hence "a white noise series contains all 
possible frequencies in an equal amount" . □ 

6.7 Example (Deterministic trigonometric series). Let Xt = A cos(Xt) + B sm(Xt) for 
mean-zero, uncorrelated variables A and B of variance a 2 , and A £ (0, tt). By Example 1.5 
the covariance function is given by 

lx {h) = a 2 cos(/iA) = <j 2 \{e iXh +e- iXh ). 

It follows that the spectral measure Fx is the discrete 2-point measure with Fx{A} = 
F x {-\} = a 2 /2. 

Because the time series is real, the point mass at —A does not really count: because 
the spectral measure of a real time series is symmetric, the point —A must be there 
because A is there. The form of the spectral measure and the fact that the time series in 
this example is a trigonometric series of frequency A, is a good motivation for thinking 
of the different values of A as "frequencies" . □ 

6.8 EXERCISE. 

(i) Show that the spectral measure of the sum X t + Y t of two uncorrelated time series 

is the sum of the spectral measures of X t and Y t . 
(ii) Construct a time series with spectral measure equal to a symmetric discrete measure 

on the points ±Ai, ±A2, . . . , ±Afe with < Ai < • ■ ■ < A& < it. 



is a version of the spectral density. 
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(iii) Construct a time series with spectral measure the 1-point measure with Fx{0} = a 2 . 
(iv) Same question, but now with Fx{n} = 0" 2 . 

The spectrum of a time series is an important theoretical concept, but it is also an 
important practical tool to gain insight in periodicities in the data. Inference using the 
spectrum is called a spectral analysis or an analysis in the frequency domain as opposed 
to an "ordinary" analysis, which is in the time domain. We should not have too great 
expectations of the insight offered by the spectrum. In some situations a spectral analysis 
leads to clear cut results, but in other situations the interpretation of the spectrum is 
complicated, or even unclear. 

The idea of a spectral analysis is to view the consecutive values 

. . . , X-i , Xq , Xi , Xi , . . . 

as a random function, from Z C M to M, and to write this as a weighted sum (or 
integral) of trigonometric functions 1 1-> cos Xt or 1 1-> sin Xt of different frequencies A. In 
simple cases finitely many frequencies suffice, whereas in other situations all frequencies 
A e (— 7r,7r] are needed to give a full description, and the "weighted sum" becomes an 
integral. Compare the examples of a deterministic trigonometric series (single frequency) 
and white noise (all frequencies) . The spectral measure gives the weights of the different 
frequencies in the sum. Physicists would call a time series a signal and refer to the 
spectrum as the weights at which the frequencies are present in the given signal. 

We shall derive the spectral decomposition in Section 6.3. Another method to gain 
insight in the interpretation of a spectrum is to consider the transformation of a spec- 
trum by filtering. The term "filtering" stems from the field of signal processing, where 
a filter takes the form of an electronic device that filters out certain frequencies from a 
given electric current. For us, a filter will remain an infinite moving average as defined 
previously. For a given filter with filter coefficients ipj the function ip{X) = V ipje~ l i x 
is called the transfer function of the filter. 

6.9 Theorem. Let X t be a stationary time series with spectral measure Fx and let 
J2j iV'jl < °°. Then Y t = ^ . ipjX t _j has spectral measure F Y given by 

dF Y (X) = \<p(X)\ 2 dF x (X). 

Proof. According to Lemma 1.27(iii) (if necessary extended to complex- valued filters), 
the series Y t is stationary with auto-covariance function 

lY{h) = X;X>teC»- k + l) = ££>?, fe^ h - k+l ^dF x (X). 

k I k I J 

By the dominated convergence theorem we are allowed to change the order of (double) 
summation and integration. Next we can rewrite the right side as J|V'(^)| elhX dFx(X). 
This proves the theorem, in view of Theorem 6.2 and the uniqueness of the spectral 
measure. ■ 
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6.10 Example (Moving average). A white noise process Z t has a constant spectral 
density a 2 /(2n). By the preceding theorem the moving average X t = Z t + 6Z t -i has 
spectral density 



fxW = \l + 8e- 



= (l + 20cosA + <9 2 ) . 

2tt 2-7T 



If 6 > 0, then the small frequencies dominate, whereas the bigger frequencies are more 
important if 6 < 0. This suggests that the sample paths of this time series will be more 
wiggly if < 0. However, in both cases all frequencies are present in the signal. □ 
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Figure 6.1. Spectral density of the moving average Xt = Zt + -52t_i. (Vertical scale in decibels.) 



6.11 Example. The process X t = Ae tXt for a mean zero variable A and A e (— it, it) 
has covariance function 

j x (h) = cov{Ae iX{ - t+h \Ae iU ) = e ihX E\A\ 2 . 

The corresponding spectral measure is the 1-point measure Fx with Fx{A} = E|>1| 2 . 
Therefore, the filtered series Y t = V . ipjX t -j has spectral measure the 1-point measure 
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with Fy{\} = \ip(X)\ 2 E\A\ 2 . By direct calculation we find that 

Y t = J2 ipjAe^-n = Ae iXt ip(X) = ip(X)X t . 
j 

This suggests an interpretation for the term "transfer function" . Filtering a "pure signal" 
Ae ltx of a single frequency apparently yields another signal of the same single frequency, 
but the amplitude of the signal changes by multiplication with the factor ip{X). If tp{X) = 
0, then the frequency is "not transmitted", whereas values of (V^-Ml bigger or smaller 
than 1 mean that the frequency A is amplified or weakened. □ 

6.12 EXERCISE. Find the spectral measure of X t = Ae' xt for A not necessarily belong- 
ing tO ( — 7T, 71"]. 

To give a further interpretation to the spectral measure consider a band pass filter. 
This is a filter with transfer function of the form 

*«={?: If Ia : tl s J 

for a fixed frequency Ao and fixed band width 25. According to Example 6.11 this filter 
"kills" all the signals Ae tXt of frequencies A outside the interval [Ao — 5, Ao + 5] and 
transmits all signals Ae ttx for A inside this range unchanged. The spectral density of the 
filtered signal Y t = V . ipjX t -j relates to the spectral density of the original signal X t 
(if there exists one) as 

/,(a)=i*«i 2 /,(a)={°; (A)i ;|J:*;|>* 

Now think of X t as a signal composed of many frequencies. The band pass filter transmits 
only the subsignals of frequencies in the interval [Ao — 5, Ao + 5]. This explains that the 
spectral density of the filtered sequence Y t vanishes outside this interval. For small 5 > 0, 

var F t = 7y (0) = f f Y (X) d\= f ° f x (X) dX k 25f x (X ). 

We interprete this as saying that fx{Xo) is proportional to the variance of the subsignals 
in X t of frequency Ao . The total variance var X t = "fx (0) = /^ fx (A) dX in the signal 
X t is the total area under the spectral density. This can be viewed as the sum of the 
variances of the subsignals of frequencies A, the area under fx between Ao — 5 and Ao + 6 
being the variance of the subsignals of frequencies in this interval. 

A band pass filter is a theoretical filter: in practice it is not possible to filter out 
an exact range of frequencies. Only smooth transfer functions can be implemented on a 
computer, and only the ones corresponding to finite filters (the ones with only finitely 
many nonzero filter coefficients ipj). 

The filter coefficients ipj relate to the transfer function ip(X) in the same way as 
the auto-covariances 'jx(h) relate to the spectral density fx{h), apart from a factor 2-ir. 
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Thus, to find the filter coefficients of a given transfer function ip, it suffices to apply the 
Fourier inversion formula 

i>i = ^f e ijX i)(\)d\. 

6.13 EXERCISE. Find the filter coefficients of a band pass filter. 

6.14 Example (Low frequency and trend). An apparent trend in observed data 
Xi,..., X n could be modelled as a real trend in a nonstationary time series, but could 
alternatively be viewed as the beginning of a long cycle. In practice, where we get to 
see only a finite stretch of a time series, low frequency cycles and slowly moving trends 
cannot be discriminated. It was seen in Chapter 1 that differencing Y t = X t — X t -i of 
a time series Xt removes a linear trend, and repeated differencing removes higher order 
polynomial trends. In view of the preceding observation the differencing filter should 
remove, to a certain extent, low frequencies. 

The differencing filter has transfer function 

tP(X) = 1 - e - iX = 2ie- iX/2 sin -. 

This absolute value | ip(\) | of this transfer function increases from at to its maximum 
value at n. Thus, indeed, it filters away low frequencies, albeit only with partial success. 
D 




Figure 6.2. Absolute value of the transfer function of the difference filter. 
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6.15 Example (Averaging), 
transfer function 

V>(A) = 



The averaging filter Y t = (2M + l)" 1 Y/?=-m x *-j has 



1 



2M + 1 



M 

£< 

j=-M 



-ijX 



sin((M + |)A) 
(2M + 1) sin(|A) ' 



This function is proportional to the Dirichlet kernel, which is the function obtained by 
replacing the factor 2M+1 by 2tt. From a picture of this kernel we conclude that averaging 
removes high frequencies to a certain extent (and in an uneven manner depending on 
M), but retains low frequencies. □ 
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Figure 6.3. Dirichlet kernel of order M = 10. 



6.16 EXERCISE. Express the variance of Y t in the preceding example in ip and the 
spectral density of the time series X t (assuming that there is one). What happens if 
M — > oo? Which conclusion can you draw? Does this remain true if the series X t does 
not have a spectral density? 



6.17 EXERCISE. Find the transfer function of the filter Y t = X t - X t -i 2 - Interprete 
the result. 
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Instead of in terms of frequencies we can also think in terms of periods. A series 
of the form t i-> e tXt repeats itself after 2-ir/X instants of time. Therefore, the period is 
defined as 

period = . 

frequency 

Most monthly time series (one observation per month) have a period effect of 12 months. 
If so, this will be visible as a peak in the spectrum at the frequency 27r/12 = n/6.^ Often 
the 12-month cycle is not completely regular. This may produce additional (but smaller) 
peaks at the harmonic frequencies 27r/6, 37t/6, . . ., or 7r/12, tt/18, . . .. 

It is surprising at first that the highest possible frequency is it, the so-called Nyquist 
frequency. This is caused by the fact that the series is measured only at discrete time 
points. Very high fluctuations fall completely between the measurements and hence can- 
not be observed. The Nyquist frequency it corresponds to a period of 2-k/-k = 2 time 
instants and this is clearly the smallest period that is observable. For time series that 
are observed in continuous time the spectrum is defined to contain all frequencies in R. 



* 6.2 Nonsummable filters 

If given filter coefficients ipj satisfy V • \ipj\ < oo, then the series ip(X) = J2j tpje^ 1 ^ con- 
verges uniformly on (—it, w], and the coefficients can be recovered from the transfer func- 
tion ip by the Fourier inversion formula ipj = (27r) _1 J* e^ x ip(i;) dX. (See Problem 6.1.) 
Unfortunately, not all filters have summable coefficients. An example is the band pass 
filter considered previously. In fact, if a sequence of filter coefficients is summable, then 
the corresponding transfer function must be continuous, and A i-> ip(X) = 1[a -<s ,a +,5](A) 
is not. Nevertheless, the series V ipje~ l i x is well-defined for the band pass filter and has 
the function l[^ _ 5jAo+5 ](A) as its limit in a certain sense. To handle examples such as 
this it is worthwhile to generalize Theorem 6.9 (and Lemma 1.27) a little. 

6.18 Theorem. Let X t be a stationary time series with spectral measure Fx, defined on 
the probability space (Q, U, P). Then the series ip(X) = V ipje~ lXi converges in I/ 2 (-Fx) 
if and only ifY t = J^jiPjXt-j converges in I/ 2 (0,W, P) for some t (and then for every 
t e Z) and in that case 

dF Y (X) = \iP(X)\ 2 dF x (X). 

Proof. For < m < n let tp" 1 ' 71 be equal to ipj for m <\j\ < n and be otherwise, and 
define Y t m ' n as the series X t filtered by the coefficients i/>™'". Then certainly J2j IV'™' 71 | < 
oo for every fixed pair (m,n) and hence we can apply Lemma 1.27 and Theorem 6.9 to 



' That this is a complicated number is an inconvenient consequence of our convention to define the spectrum 
on the interval (— 7r, 7r). This can be repaired. For instance, the Splus package produces spectral plots with the 
frequencies rescaled to the interval (— ^, ^). Then a 12-month period gives a peak at 1/12. 
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the series Y™' n . This yields 



e| E ^x^ = E|ir , T = 7y--(o) 



m<|j|<7 



= / dF Y -.. = [ | e 

•/f — 7r,7rl -/ ( — 7r,7rl ^i .1 



V>je 



-iXj 



dFx(X). 



The left side converges to zero for m, n — > oo if and only if the partial sums of the series 
Y t = V . tpjXt-j form a Cauchy sequence in 1/2(0, W, P); the right side converges to zero 
if and only if the partial sums of the sequence V • ipje^ zX ^ form a Cauchy sequence in 
Lz(Fx). The first assertion of the theorem now follows, because both spaces are complete. 
To prove the second assertion, we first note that, by Theorem 6.9, 



cov 



( £ tjXt+H-j, E 1>iX t -i) = 7 yo,„ (h)= f | £ ^e- iXj \\ ihX dF x (A) 

\j\<n \j\<n J (-*,*] ^< n 



We now take limits of the left and right sides as n 4 oo to find that 7y(/i) = 
/ ( _ w , T ]|^W| 2 e« hA dF x (A), for every h. m 

6.19 Example. If the filter coefficients satisfy V \tpj\ 2 < °o (which is weaker than 
absolute convergence), then the series V ipje^^ x converges in I/2(A) for A the Lebesgue 
measure on (—ir,ir\. This is a central fact in Fourier theory and follows from 

f E i^-* Aa <tt= E E Mf^ )J ^= E i^'i 2 - 

>/— ir „_,--i ■]»--_ „^ii.i^ — - 1 ^ l _ — - ■»— ir 'Ui^_ 



i<|fe| <ti m< |/|< 



771 < I J? I < -7X 



Consequently, the series also converges in L 2 (Fx) for every spectral measure Fx that 
possesses a bounded density. 

Thus, in many cases a sequence of square-summable coefficients defines a valid filter. 
A particular example is a band pass filter, for which \ipj\ = 0{l/\j\) as j — > ±oo. n 



* 6.3 Spectral Decomposition 

In the preceding section we interpreted the mass Fx{I) that the spectral distribution 
gives to an interval I as the size of the contribution of the components of frequencies 
A e I to the signal t i-> X±. In this section we give a precise mathematical meaning to 
this idea. We show that a given stationary time series X t can be written as a randomly 
weighted sum of single frequency signals e lXt . 
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This decomposition is simple in the case of a discrete spectral measure. For given 

uncorrelated mean zero random variables Z\,...,Zk and numbers Ai, . . . , A* £ (— tt, n] 

the process 

k 

X t =^Z j e iX ' t 

3=1 

possesses as spectral measure Fx the discrete measure with point masses of sizes 

Fx{Xj} = E|Zj| 2 at the frequencies Ai,...,Aj. (and no other mass). The series X t is 

the sum of uncorrelated, single-frequency signals of stochastic amplitudes \Zj\. This is 

called the spectral decomposition of the series X t . We prove below that this construction 

can be reversed: given a mean zero stationary sequence Xt with discrete spectral measure 

as given, there exist mean zero uncorrelated random variables Z±, . . . , Zj. with variances 

Fx{Aj} such that the decomposition is valid. 

This justifies the interpretation of the spectrum given in the preceding section. The 
possibility of the decomposition is surprising in that the spectral measure only involves 
the auto-covariance function of a time series, whereas the spectral decomposition is a 
decomposition of the sample paths of the time series: if the series X t is defined on a 
given probability space (Q,U, P), then so are the random variables Zj and the preceding 
spectral decomposition may be understood as being valid for every w € fi, almost surely. 
This can be true, of course, only if the variables Zi,...,Zk also have other properties 
besides the ones described. The spectral theorem below does not give any information 
about these further properties. For instance, even though uncorrelated, the Zj need not 
be independent. This restricts the usefulness of the spectral decomposition, but we could 
not expect more. The spectrum only involves the second moment properties of the time 
series, and thus leaves most of the distribution of the series undescribed. An important 
exception to this rule is if the series X t is Gaussian. Then the first and second moments, 
and hence the mean and the spectral distribution, completely describe the distribution 
of the series X t - 

The spectral decomposition is not restricted to time series with discrete spectral 
measures. However, in general, the spectral decomposition involves a continuum of fre- 
quencies and the sum becomes an integral 



X t = I e ixt dZ{\). 

J (— ir.ir] 



A technical complication is that such an integral, relative to a "random measure" Z, is 
not defined in ordinary measure theory. We must first give it a meaning. 

6.20 Definition. A random measure with orthogonal increments Z is a collection 
{Z(B):B £ B} of mean zero, complex random variables Z(B) indexed by the Borel 
sets B in (— 7r,7r] defined on some probability space (Q,U,P) such that, for some finite 
Borel measure \x on (—tt,tt], 

cov (Z{B 1 ),Z{B 2 )) =n{B 1 r\B 2 ), every B 1 ,B 2 e B. 

This definition does not appear to include a basic requirement of a measure: that the 
measure of a countable union of disjoint sets is the sum of the measures of the individual 
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sets. However, we leave it as an exercise to show that this is implied by the covariance 
property. 

6.21 EXERCISE. Let Z be a random measure with orthogonal increments. Show that 
Z(L)jBj) = ^2,jZ{Bj) in mean square, whenever B\,Bi,... is a sequence of pairwise 
disjoint Borel sets. 

6.22 EXERCISE. Let Z be a random measure with orthogonal increments and define 
Z\ = Z(—tt,X\. Show that {Zy.\ e (—• 7r, 7r]) is a stochastic process with uncorrelated 
increments: for Ai < A 2 < A 3 < A 4 the variables Z\ 4 —Z\ 3 and Z\ 2 —Z\ 1 are uncorrelated. 
This explains the phrase "with orthogonal increments" . 

* 6.23 EXERCISE. Suppose that Z\ is a mean zero stochastic process with finite second 
moments and uncorrelated increments. Show that this process corresponds to a random 
measure with orthogonal increments as in the preceding exercise. [This asks you to 
reconstruct the random measure Z from the weights Z\ = Z(—tt, A] it gives to cells, 
similarly as an ordinary measure can be reconstructed from its distribution function.] 

Next we define an "integral" J f dZ for given functions /:(—• 7r, 7r] — > C. For an 
indicator function / = \b of a Borel set B, we define, in analogy with an ordinary 
integral, J 1b dZ = Z(B). Because we wish the integral to be linear, we are lead to the 
definition 

Y^ oijl Bl dZ = Y a j z ( B i), 

3 J 

for every finite collections of complex numbers otj and Borel sets Bj. This determines 
the integral for many, but not all functions /. We extend its domain by continuity: 
we require that J f n dZ — > J f dZ whenever /„ — ► / in L2(u). The following lemma 
shows that these definitions and requirements can be consistently made, and serves as a 
definition of J f dZ. 

6.24 Lemma. For every random measure with orthogonal increments Z there exists a 
unique map, denoted / H> J f dZ, from L 2 (ij,) into I/2(fi,W, P) with the properties 
(i) Jl B dZ = Z(B); 
(ii) J(af + Pg)dZ = aJfdZ + Pjg dZ; 

(in) E\JfdZ\ 2 =J\f\ 2 du,. 

In other words, the map f i-> J f dZ is a linear isometry such that 1b •-> Z(B). 

Proof. By the defining property of Z, for any complex numbers a^ and Borel sets Bi, 



/: 



E 



^a<Z(B0| 2 = Y^a i ajCOv{Z{B i ),Z{B i )) = J^adB, 



o j=i 



2 



dfi. 



For / a simple function of the form / = Y^i a ^Si, we define J f dZ as ^ i aiZ(Bi). 
This is well-defined, for, if / also has the representation / = Y^j^Adj, then 
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J2i ociZ(Bi) = V ■ ftjZ(Dj) almost surely. This follows by applying the preceding identity 
to'£ i a i Z(B i )-'E j P J Z(D j ). 

The "integral" J f dZ that is now defined on the domain of all simple functions / 
trivially satisfies (i) and (ii), while (iii) is exactly the identity in the preceding display. 
The proof is complete upon showing that the map / i-> J f dZ can be extended from 
the domain of simple functions to the domain L 2 (n), meanwhile retaining the properties 
(i)-(iii). 

We extend the map by continuity. For every / £ L 2 (/j) there exists a sequence of 
simple functions /„ such that J |/„ — /| 2 dfx — ► 0. We define J f d/j, as the limit of the 
sequence J /„ dZ. This is well-defined. First, the limit exists, because, by the linearity of 
the integral and the identity, 



E 



f n dfi- f m dn\ 2 = / |/„ - /„ 



dfi, 



since f n — f m is a simple function. Because f n is a Cauchy sequence in L 2 (n), the right 
side converges to zero asm,n-j oo. We conclude that J /„ dZ is a Cauchy sequence in 
L 2 (Q,U, P) and hence it has a limit by completeness. Second, the definition of J f dZ 
does not depend on the particular sequence /„ — » f we use. This follows, because given 
another sequence of simple functions g n — t /, we have J \f n — g n \ 2 dn — ► and hence 
E\Jf n dZ-Jg n dZ\ 2 -*0. 

We conclude the proof by noting that properties (i)-(iii) are retained under taking 
limits. ■ 

6.25 EXERCISE. Show that a linear isometry $:Hi -> H 2 between two Hilbert spaces 
Hi and H2 retains inner products, i.e. (3>(/i), 3>(/2))2 = (/i,/2)i- Conclude that 
cov(JfdZjgdZ) =Jfgdii. 

We are now ready to derive the spectral decomposition for a general stationary time 
series X t . Let L 2 {X t : t e Z) be the closed, linear span of the elements of the time series 
in 1/2(0, U,P) (i.e. the closure of the linear span of the set {X t :t e Z}). 

6.26 Theorem. For any mean zero stationary time series X t with spectral distribution 
Fx there exists a random measure Z with orthogonal increments relative to the measure 
F x such that {Z{B):B e B) c L 2 {X t :t e Z) and such that X t = J e iXt dZ(X) almost 
surely for every t € Z . 

Proof. By the definition of the spectral measure Fx we have, for every finite collections 
of complex numbers olj and integers tj , 



E 



2 f 2 



Now define a map $:L 2 (Fx) -> L 2 (X t :t e Z) as follows. For / of the form / = 
J2j a j e ' tjX define $(/) = ^a.jX ti . By the preceding identity this is well-defined. 
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(Check!) Furthermore, $ is a linear isometry. By the same arguments as in the pre- 
ceding lemma, it can be extended to a linear isometry on the closure of the space of 
all functions X]j a.j ertjA - By Fejer's theorem from Fourier theory, this closure contains 
at least all Lipschitz periodic functions. By measure theory this collection is dense in 
Li{Fx)- Thus the closure is all of L^Fx)- In particular, it contains all indicator func- 
tions 1b of Borel sets B. Define Z(B) = $(1b). Because $ is a linear isometry, it retains 
inner products and hence 

cov(Z(Bi),Z(B 2 )) =<$(l Bl ),$(l Ba )>= Jl Bl lB 2 dF x . 

This shows that Z is a random measure with orthogonal increments. By definition 

JY^ojIb, dZ = Y J *iZ{B i ) = 5>,$(l Bj ) = $(^a,l Bj ). 

3 J J 

Thus J f dZ = $(/) for every simple function /. Both sides of this identity are linear 
isometries when seen as functions of / e L^iFx)- Hence the identity extends to all / e 
L^iFx)- In particular, we obtain Je rtA dZ(X) = 3>(e rtA ) = X t on choosing /(A) = e ltx . m 

Thus we have managed to give a precise mathematical formulation to the spectral 
decomposition 



X t = f e itx dZ{\) 

J (— 7T,7rl 



' (— 7r,7r] 

of a mean zero stationary time series X t . The definition may seem a bit involved. An 
insightful interpretation is obtained by approximation through Riemann sums. Given 
a partition — it = Ao,fe < Ai^ < ••• < Xk,k = it consider the function A i-> /fe(A) 
that is piecewise constant, and take the value e ztx i- k on the interval (^j-i,k,^j,k]- If the 
partitions are chosen such that the mesh width of the partitions converges to zero as 
k — ► oo, then |/fe(A) — e ltx \ converges to zero, uniformly in A € (— it, it], by the uniform 
continuity of the function A i-> e rtA , and hence /fe(A) — > e ltx in L^Fx)- Because the 
stochastic integral / i-> J f dZ is linear, we have J fkdZ = V • e^ A >>* Z(\j-i t k^j,k] an d 
because it is an isometry, we find 



E 



X t - £V* A ^Z(A^ M , A^]| 2 = j\e itx - h(X)\ 2 dF x (X) -> 0. 



Because the intervals (Aj-i^, \j t k] are pairwise disjoint, the random variables Zj-.= 
Z(Aj-i t k, Xj t k] are uncorrelated, by the defining property of an orthogonal random mea- 
sure. Thus the time series X t can be approximated by a time series of the form V Zje lXit , 
as in the introduction of this section. The spectral measure Fx{^j-i,k,^j,k\ of the interval 
(•\j-i,*j A^fc] is the variance of the random weight Zj in this decomposition. 
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6.27 Example. If the spectral measure Fx is discrete with support points Ai, . . . , Afc, 
then the integral on the right in the preceding display (with Xj y k = \j) is identically 
zero. In that case X t = Y^j Zje zXit almost surely for every t. a 

6.28 Example. If the time series X t is Gaussian, then all variables in the linear span 
of the X t are normally distributed (possibly degenerate) and hence all variables in 
L 2 (X t : (eZ) are normally distributed. In that case the variables Z(B) obtained from the 
random measure Z of the spectral decomposition of X t are jointly normally distributed. 
The zero correlations of two variables Z(B\) and Z(B 2 ) for disjoint sets B\ and B 2 now 
imply independence of these variables. □ 

Theorem 6.9 shows how a spectral measure changes under filtering. There is a cor- 
responding result for the spectral decomposition. 

6.29 Theorem. Let X t be a mean zero, stationary time series with spectral measure 
Fx and associated random measure Zx, defined on some probability space (Q,U, P). If 
ip(\) = J2j ipje~ lXj converges in L 2 {Fx), then Y t = J2jtpjX t -j converges in L 2 (£l,U,P) 
and has spectral measure Fy and associated random measure Zy such that, for every 

f e L 2 {Fy), 

J fdZy = J fi>dZ X . 

Proof. The series Y t converges by Theorem 6.9, and the spectral measure Fy has density 

l 1 2 

|^>(A)| relative to Fx. By definition, 

J e itx dZ Y (A) =Y t = Y,^jJ e i{t - j)x dZ x (A) = j e itx ^(X) dZ x (A) , 

j 

where in the last step changing the order of integration and summation is justified by the 
convergence of the series J^_ .ipje^ 1 ~^ x in L 2 {Fx) and the continuity of the stochastic 
integral /•->// dZx ■ We conclude that the identity of the theorem is satisfied for every 
/ of the form /(A) = e ttx . Both sides of the identity are linear in / and isometries on 
the domain / e L 2 {F Y ). Because the linear span of the functions A i-> e ttx for t £ Z is 
dense in L 2 (Fy), the identity extends to all of L 2 (Fy), by linearity and continuity. ■ 

6.30 Example (Law of large numbers). An interesting application of the spectral 
decomposition is the following law of large numbers. If Xt is a mean zero, stationary 
time series with associated random measure Zx, then X n E>. Zx{0} as n — > oo. In 
particular, if F x {0} = 0, then X n E> 0. 

To see this, we write 

Here the integrand must be read as 1 if A = 0. For all other A £ (— 7r,7r] the integrand 
converges to zero as n — ► oo. It is bounded by 1 for every A. Hence the integrand converges 
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in second mean to 1| } in L 2 {Fx) (and every other I/2-space). By the continuity of the 
integral / i-» J / dZx , we find that X n converges in L^^l, U, P) to J 1{ } dZx = Zx{0}- 

D 



6.4 Multivariate Spectra 

If spectral analysis of univariate time series' is hard, spectral analysis of multivariate 
time series is an art. It concerns not only "frequencies present in a single signal" , but 
also "dependencies between signals at given frequencies" . 

This difficulty concerns the interpretation only: the mathematical theory does not 
pose new challenges. The covariance function jx of a vector- valued times series X t is 
matrix-valued. If the series X^^ezll^C 1 )! ^ s convergent, then the spectral density of the 
series X t is defined by exactly the same formula as before: 

hez 

The summation is now understood to be entry- wise, and hence A i-» /x(A) maps the 
interval (— n, tt] into the set of (d x of)-matrices, for d the dimension of the series X t . 
Because the covariance function of the univariate series a T X± is given by 7 a T x = a T Jxa, 
it follows that, for every a £ C fe , 

a T fx{\)a = f aTX {\). 

In particular, the matrix /x(A) is nonnegative-definite, for every A. From the identity 
')x{—h) T = 7x(/i) it can also be ascertained that it is Hermitian. The diagonal elements 
are nonnegative, but the off-diagonal elements of /x(A) are complex- valued, in general. 
As in the case of univariate time series, not every vector- valued time series possesses 
a spectral density, but every such series does possess a spectral distribution. This "dis- 
tribution" is a matrix-valued, complex measure. A complex Borel measure on (— it, tt] is 
a map B i-> F(B) on the Borel sets that can be written as F = F\ — F 2 + i(F 3 - F 4 ) for 
finite Borel measures Fi, F2, F3, F4. If the complex part F3 — F4 is identically zero, then 
F is a signed measure. The spectral measure Fx of a d-dimensional time series X t is a 
(d x d) matrix whose d 2 entries are complex Borel measures on (— tt, it]. The diagonal 
elements are precisely the spectral measures of the coordinate time series' and hence 
are ordinary measures, but the off-diagonal measures are typically signed or complex 
measures. The measure Fx is Hermitian in the sense that Fx{B) = Fx{B) for every 
Borel set B. 

6.31 Theorem (Herglotz). For every stationary vector-valued time series X t there 
exists a unique Hermitian matrix-valued complex measure Fx on (—it, it] such that 



= / 

J ( — TT.TT 



lx {h)= / e th UF x (X), he 
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Proof. For every a £ C d the time series a T X t is univariate and possesses a spectral 
measure F a T X . By Theorem 6.2 for every h £ Z, 

a T 7 x(/i)a = 7-t* (A) = / e ihX dF a T X {\). 

J ( — 7r,7r] 

We can express any entry of the matrix j x {h) as a linear combination of the the quadratic 
form on the left side, evaluated for different vectors a. Specifically, with a the ith unit 
vector in C d , 

2Re-y x (h)i,j = ej-y x (h)ei - ej-y x (h)ej - (a - ej) T ^ x {h){ei - e,), 
2Im7x(/i)i,j = ej-y x (h)ei - ej-y x (h)ej - (e t - iej) T 7x(/i)(ej + iej). 

We can next express the right hand sides in the spectral matrices F a T X , by the first 
display. ■ 

Consider in particular a bivariate time series, written as (Xt, Y±) for univariate times 
series X t and Y t . The spectral density of (X t ,Y t ), if it exists, is a (2 x 2)-matrix valued 
fucntion. The diagonal elements are the spectral densities f x and fy of the univariate 
series X t and Y t . The off-diagonal elements are complex conjugates and thus define one 
function, say f X y for the (1, 2)-element of the matrix. The following derived functions 
are often plotted: 



Re/xy, 


co-spectrum, 


ImfxY, 


quadrature, 


\fx Y \ 2 

fxfy 


coherency, 


\fxy\, 


amplitude, 


arg/^y, 


phase. 



It requires some experience to read these plots appropriately. The coherency is perhaps 
the easiest to interprete: it is the "correlation between the series X and Y at the frequency 
A". 



7 

ARIMA Processes 



For many years ARIMA processes were the work horses of time series analysis, "time 
series analysis" being almost identical to fitting an appropriate ARIMA process. This 
important class of time series models are defined through linear relations between the 
observations and noise factors. 



7.1 Backshift Calculus 

To simplify notation we define the backshift operator B through 

BX t = X t -i, B X-t = X t -k- 

This is viewed as operating on a complete time series X t , transforming this into a new 
series by a time shift. Even though we use the word "operator" we shall use B only as 
a notational device. In particular, BY t = Y t -\ for any other time series Y t .* 
For a given polynomial tp(z) = V • i'j^ we also abbreviate 

i)(B)X t = Y,TpjXt-j. 
j 

If the series on the right is well-defined, then we even use this notation for infinite Laurent 
series ^2™L_ rx> tpjZ^. Then ip{B)X t is simply a short-hand notation for the (infinite) linear 
filters that we encountered before. By Lemma 1.27 the time series ip{B)X t is certainly 
well-defined if V • \ipj\ < oo and sup t E|A" t | < oo, in which case the series converges both 
almost surely and in mean. 



* Be aware of the dangers of this notation. For instance, if Yt = X— t, then BYt = ^t — 1 = X— (t— 1)- 
This is the intended meaning. We could also argue that BYt = BX—t = X—t—l- This is something else. Such 
inconsistencies can be avoided by defining B as a true operator, for instance a linear operator acting on the 
linear span of a given time series, possibly depending on the time series. 
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If X^j IV'jl < °°; then the Laurent series ^jipjZ 3 converges absolutely on the unit 
circle {z £ C: \z\ = l} in the complex plane and hence defines a function ip{z). Given 
two of such series (or functions) ip\{z) = V i'ljz 3 and ip2(z) = Ylj i>2,jz\ the product 
ip{z) = ip\{z)ip2{z) is a well-defined function on (at least) the unit circle. By changing 
the summation indices this can be written as 

ip(z) = ipi{z)ip 2 {z) = ^2ipjZ J , ip k = ^2lpl,j1p2,k-j- 

3 3 

The coefficients ipj are called the convolutions of the coefficients Vi,j an d ip2,j- The 
Laurent series E& ipkZ k converges absolutely at least on the unit circle. In fact E& \tpk\ < 

oo. 

7.1 EXERCISE. Show that £ fe |^| < Y.j IV-ul E, IlKil- 

Having defined the function ip{z) and verified that it has an absolutely convergent 
Laurent series representation on the unit circle, we can now also define the time series 
ip{B)X t . The following lemma shows that the convolution formula remains valid if z is 
replaced by B, at least when applied to stationary time series. 

7.2 Lemma. If both V. \ipij\ < oo and V- \ip2j\ < oo, then, for every time series X t 
with sup t E|X t | < oo, 



il>{B)X t =il> l {B)[i>2{B)X t ], 



a.s.. 



Proof. The right side is to be read as ip\{B)Y t for Y t = tp2(B)X t . The variable Y t is 
well-defined almost surely by Lemma 1.27, because V- \ip2,j\ < oo and sup t E|X t | < oo. 
Furthermore, 

supE|y t | =supE V"V2,j^t-j < V'l^jIsupElXtl < oo. 
i t t 

3 3 

Thus the time series ip\{B)Y t is also well-defined by Lemma 1.27. Now 

E££hMIV<2j||* t _^| <su P ELY t | Y, KM El^l < oo- 

i 3 i 3 

This implies that the double series Yli Ei ipi,iip2,jX t -i-j converges absolutely, almost 
surely, and hence unconditionally. The latter means that we may sum the terms in an 
arbitrary order. In particular, by the change of variables (i, j) i-> (i = l,i+ j = k), 

'52i>Li{%2i>2,3 X t-i-j) =^2{}2tpi,itp2,k-i)X t - k , a.s.. 

i j k I 

This is the assertion of the lemma, with ipi{B) [ip 2 (B)X t ] on the left side. ■ 

The lemma implies that the "operators" ipi(B) and tp2(B) commute, and in a sense 
asserts that the "product" ip\{B)ip2{B)X t is associative. Thus from now on we may omit 
the square brackets in tpi(B) [tp2(B)X t ] . 
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7.3 EXERCISE. Verify that the lemma remains valid for any sequences ip\ and tp2 with 
SjV'i.il < °° an d every process X t such that £\ Ylj |V'i,i||V'2,j||-^t-i-j| < oo almost 
surely. In particular, conclude that ip\{B)ip2{B)X t = {ipiip2){B)X t for any polynomials 
ipi and ip2 and every time series X t . 



7.2 ARMA Processes 

Before discussing general ARIMA processes, we consider ARMA processes. 

7.4 Definition. A time series X t is an ARMA(p, q)-process if there exist polynomials 4> 
and 6 of degrees p and q, respectively, and a white noise series Z t such that (f>(B)X t = 
6{B)Z t . 

The equation (f>(B)X t = 6(B)Z t is to be understood as "pointwise almost surely" 
on the underlying probability space: the random variables X t and Z t are defined on a 
probability space (£l,U, P) and satisfy (f>(B)X t (uj) = 6(B)Z t (uj) for almost every w £ fi. 

The polynomials are often 1 ' written in the forms (ft(z) = 1 — (j)\z — foz 2 — • • • — (ft p z p 
and 9(z) = l + 6 1 z-\ 1- d q z q . Then the equation 4>{B)X t = 6(B)Z t takes the form 



X t = 4>iX t -i + <j)2X t -2 + ■ ■ ■ + <j)pXt- p + Z t + 0\Z t -i + h o q z t - 



1- 



In other words: the value of the time series X t at time t is the sum of a linear regression 
on its own past and of a moving average. An ARMA(p, 0)-process is also called an auto- 
regressive process and denoted AR(p); an ARMA(0, q>)-process is also called a moving 
average process and denoted MA(q). Thus an auto- regressive process is a solution X t to 
the equation (f>(B)X t = Z t , and a moving average process is explicitly given by X t = 
0(B)Z t . 

7.5 EXERCISE. Why is it not a loss of generality to assume (f>o = 6o = 1? 

We next investigate for which pairs of polynomials (f> and 6 there exists a corre- 
sponding stationary ARMA-process. For given polynomials cf> and 6 there are always 
many time series X t and Z t satisfying the ARMA equation, but there need not be a 
stationary series X t . If there exists a stationary solution, then we are also interested in 
knowing whether this is uniquely determined by the pair ((/>, 6) and/or the white noise 
series Z t , and in what way it depends on the series Z t . 



A notable exception is the Splus package. Its makers appear to have overdone the cleverness of including 
minus-signs in the coefficients of (j) and have included them in the coefficients of also. 
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7.6 Example. The polynomial <j>(z) = 1 — (f>z leads to the auto-regressive equation 
X t = (f>X t -i + Z t . In Example 1.8 we have seen that a stationary solution exists if and 
only if \<f>\ ^ 1. □ 

7.7 EXERCISE. Let arbitrary polynomials cf> and 6, a white noise sequence Z t and vari- 
ables Xi,...,X p be given. Show that there exists a time series X t that satisfies the 
equation </>(B)X t = 6{B)Z t and coincides with the given X\, . . . , X p at times l,...,p. 
What does this imply about existence of solutions if only the Z t and the polynomials (f> 
and 6 are given? 

In the following theorem we shall see that a stationary solution to the ARMA- 
equation exists if the polynomial z i-> <j>(z) has no roots on the unit circle {^eC:|2|=l}. 
To prove this, we need some facts from complex analysis. The function 

is well-defined and analytic on the region {z € C:cf)(z) ^ 0}. If (f> has no roots on the 
unit circle {z: \z\ = l}, then since it has at most p different roots, there is an annulus 
{z:r < \z\ < R} with r < 1 < R on which it has no roots. On this annulus ip is an 
analytic function, and it has a Laurent series representation 

oo 

j= — oo 

This series is uniformly and absolutely convergent on every compact subset of the annu- 
lus, and the coefficients ipj are uniquely determined by the values of ip on the annulus. 
In particular, because the unit circle is inside the annulus, we obtain that J^j iV'jl < °°- 
Then we know that ip{B)Z t is a well-defined, stationary time series. By the following 
theorem it is the unique stationary solution to the ARMA-equation. (Here by "solution" 
we mean a time series that solves the equation up to null sets, and the uniqueness is also 
up to null sets.) 

7.8 Theorem. Let 4> and 6 be polynomials such that <j) h&s no roots on the complex 
unit circle, and let Z t be a white noise process. Define ip = 6/<j). Then X t = ip{B)Z t 
is the unique stationary solution to the equation (f>(B)X t = 6(B)Z t . It is also the only 
solution that is bounded in L\ . 

Proof. By the rules of calculus justified by Lemma 7.2, (f>(B)tp(B)Z t = 8(B)Z t , because 
<j){z)ip{z) = 6{z) on an annulus around the unit circle. This proves that ip{B)Z t is a 
solution to the ARMA-equation. It is stationary by Lemma 1.27. 

Let X t be an arbitrary solution to the ARMA equation that is bounded in L\, for 
instance a stationary solution. The function <j>{z) = l/<j>{z) is analytic on an annulus 
around the unit circle and hence possesses a unique Laurent series representation <j>{z) = 
Y^,j 4>jZ 3 ■ Thus <j>{B)Y t is well-defined for every stationary time series Y t by Lemma 1.27. 
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By the calculus of Lemma 7.2 <ft{B)<ft{B)X t = X t , because <j>(z)<f>(z) = 1. Therefore, the 
equation 4>{B)X t = 6(B)Z t implies, after multiplying by 4>(B), that X t = 4>(B)6(B)Z t = 
tp(B)Z t , again by the calculus of Lemma 7.2, because (f>(z)6(z) = tp(z). This proves that 
ip{B)Z t is the unique stationary solution to the ARMA-equation. ■ 

7.9 EXERCISE. It is certainly not true that ift{B)Z t is the only solution to the ARMA- 
equation. Can you trace where exactly in the preceding proof we use the required sta- 
tionarity of the solution? Would you agree that the "calculus" of Lemma 7.2 is perhaps 
more subtle than it appeared to be at first? 

Thus the condition that (ft has no roots on the unit circle is sufficient for the existence 
of a stationary solution. It is almost necessary. The only point is that it is really the 
quotient 6/(f> that counts, not the function (ft on its own. If (ft has a zero on the unit circle 
of the same or smaller multiplicity as 6, then this quotient is still a nice function. Once 
this possibility is excluded, there can be no stationary solution if (ft(z) = for some z 
with \z\ = 1. 

7.10 Theorem. Let (ft and 6 be polynomials such that (ft has a root on the unit circle 
that is not a root of 6, and let Z t be a white noise process. Then there exists no stationary 
solution X t to the equation cft(B)X t = 6(B)Z t . 

Proof. Suppose that the contrary is true and let X t be a stationary solution. Then X t 
has a spectral distribution Fx, and so does cft(B)X t = 8(B)Z t . By Theorem 6.9 and 
Example 6.6 this must satisfy 

2 

\<ft{e- iX )\ 2 dF x {\) = \0{e- iX )\ 2 ?-d\. 

Now suppose that <ft{e~ lX °) = and 6{e~ lX °) ^ for some Ao- The preceding display is 
just an equation between densities of measures and should not be interpreted as being 
valid for every A, so we cannot immediately conclude that there is a contradiction. By 
differentiability of (ft and continuity of 6 there exist positive numbers A and B and a 
neighbourhood of Ao on which both |(/>(e~ 2A )| < ^4| A — Ao| and |0(e~* A )| > B. Combining 
this with the preceding display, we see that, for all sufficiently small e > 0, 



/ A*\\-\ fdF x (\)> / B^- 

'A -£ JA -e Z7r 



The left side is bounded above by A 2 e 2 Fx(Xo — e, Ao + e), whereas the right side is equal 
to B 2 a 2 e/n. This shows that Fx (Ao — e, Ao + e) — > oo as e — > and contradicts the fact 
that Fx is a finite measure. ■ 

7.11 Example. The AR(l)-equation X t = <ftX t -i + Z t corresponds to the polynomial 
(ft(z) = 1 — (ftz. This has root (ft^ 1 . Therefore a stationary solution exists if and only if 
|^ _1 | 7^ 1. In the latter case, the Laurent series expansion of tft(z) = 1/(1 — (ftz) around 
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the unit circle is given by ip(z) = J27Lo ft z ^ f° r 101 < 1 an d is given by — Yl°jLi fi^-'z^i 
for |0| > 1. Consequently, the unique stationary solutions in these cases are given by 



X, 



-T,7=iirZ t+J , if|0|>l. 



This is in agreement, of course, with Example 1.8. □ 

7.12 EXERCISE. Investigate the existence of stationary solutions to: 
(i) X-t = yXt-i + TyXt-i + Z t ; 
(ii) X-t = ijXt-i + jX-t-2 + Z t + 2%t-i + jZt-2- 

Warning. Some authors require by definition that an ARMA process be stationary. 
Many authors occasionally forget to say explicitly that they are concerned with a sta- 
tionary ARMA process. Some authors mistakenly believe that stationarity requires that 
<j) has no roots inside the unit circle and may fail to recognize that the ARMA equation 
does not define a process without some sort of initialization. 

If given time series' X t and Z t satisfy the ARMA-equation (f>(B)X t = 8(B)Z t , then 
they also satisfy r(B)(f>(B)X t = r(B)6(B)Z t , for any polynomial r. From observed data 
X t it is impossible to determine whether ((f), 6) or (r(p,r6) are the "right" polynomials. 
To avoid this problem of indeterminacy, we assume from now on that the ARMA-model 
is always written in its simplest form. This is when (f> and 6 do not have common factors 
(are relatively prime in the algebraic sense), or equivalently, when (f> and 6 do not have 
common (complex) roots. Then, in view of the preceding theorems, a stationary solution 
X t to the ARMA-equation exists if and only if (f> has no roots on the unit circle, and this 
is uniquely given by 

x t = i)(B)z t = J2i>jZt-j, V> = t 



7.13 Definition. An ARMA-process X t is called causal if, in the preceding representa- 
tion, the filter is causal: i.e. ipj = for every j < 0. 

Thus a causal ARMA-process X t depends on present and past values Z t , Z t _i, . . . 
of the noise sequence only. Intuitively, this is a desirable situation, if time is really time 
and Z t is really attached to time t. We come back to this in Section 7.6. 

A mathematically equivalent definition of causality is that the function ip(z) is an- 
alytic in a neighbourhood of the unit disc {z € C: \z\ < l}. This follows, because the 
Laurent series X^-oo ^i 2 ^ ls analytic inside the unit disc if and only if the negative 
powers of z do not occur. Still another description of causality is that all roots of cf> are 
outside the unit circle, because only then is the function ip = 6/(f> analytic on the unit 
disc. 

The proof of Theorem 7.8 does not need that Z t is a white noise process, but only 
that the series Z t is bounded in L\. Therefore, the same arguments can be used to invert 
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the ARMA-equation in the other direction. If 8 has no roots on the unit circle and X t is 
stationary, then (f>(B)X t = 6(B)Z t implies that 

Z t = -K{B)X t = Y^ KjXt-j , -k = - . 

3 

7.14 Definition. An ARMA-process X t is called invertible if, in the preceding repre- 
sentation, the filter is causal: i.e. -Kj = for every j < 0. 

Equivalent mathematical definitions are that n(z) is an analytic function on the unit 
disc or that 6 has all its roots outside the unit circle. In the definition of invertibility we 
implicitly assume that 6 has no roots on the unit circle. The general situation is more 
technical and is discussed in the next section. 



* 7.3 Invertibility 

In this section we discuss the proper definition of invertibility in the case that 6 has 
roots on the unit circle. The intended meaning of "invertibility" is that every Z t can be 
written as a linear function of the X s that are prior or simultaneous to t. Two reasonable 
ways to make this precise are: 
(i) Z t = YjLo n jXt-j for a sequence ttj such that Y1t*Lq \ n j\ < °°- 
(ii) Z t is contained in the closed linear span of X t , X t -i,X t -2, • • • in L2(£l,U, P). 
In both cases we require that X t depends linearly on the prior X s , but the second 
requirement is weaker. It turns out that if X t is an ARMA process relative to Z t and 
(i) holds, then the polynomial 6 cannot have roots on the unit circle. In that case the 
definition of invertibility given in the preceding section is appropriate (and equivalent to 
(i)). However, the requirement (ii) does not exclude the possibility that 6 has zeros on 
the unit circle. An ARMA process is invertible in the sense of (ii) as soon as 6 does not 
have roots inside the unit circle. 

7.15 Lemma. Let X t be a stationary ARMA process satisfying (f>(B)X t = 6(B)Z t for 
polynomials (j> and 8 that are relatively prime, 
(i) Then Z t = Y^jLo^j-^t-j for a sequence ttj such that Y1°jLq \ n j\ < °° if and only if 

6 has no roots on or inside the unit circle, 
(ii) If 6 has no roots inside the unit circle, then Z t is contained in the closed linear span 
of X t , X t -i, A" t _2, — 

Proof, (i). If 6 has no roots on or inside the unit circle, then the ARMA process is 
invertible by the arguments given previously. We must argue the other direction. If Z t 
has the given given reprentation, then consideration of the spectral measures gives 



2^ " s "' "" ' ! " v ' v "" ' ! U(e- iX )\ 2 2 n~ 



— d\ = dF z (\) = 7r(e- zA ) \ dF x (X) = k(e" zA ) -^ — -^ —d\ 
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Hence \-K(e~ tX )6(e~ tX )\ = \(f>(e~ tX )\ Lebesgue almost everywhere. If V- \ifj\ < °°; then 
the function A i-> 7r(e~ zA ) is continuous, as are the functions <j> and 6, and hence this 
equality must hold for every A. Since <j>(z) has no roots on the unit circle, nor can 6(z). 
(ii). Suppose that C 1 is a zero of 6, so that \(\ < 1 and 8{z) = (1 — C,z)6\(z) 
for a polynomial 6\ of degree q — 1. Define Y t = (f>(B)X t and V t = 6\(B)Z t , whence 
Y t = Vt- QVt-i. It follows that 

fc-i fc-i 

E <? Y *-i = E ^(^*-i - CK-j-i) = y* - c fe ^- fe - 

If |£| < 1, then the right side converges to V t in quadratic mean as k — > oo and hence 
it follows that V t is contained in the closed linear span of Y t , Y t _i, . . ., which is clearly 
contained in the closed linear span of X t , X t ~i, ■ . ., because Y t = (j>(B)X t . If q = 1, then 
V t and Z t are equal up to a constant and the proof is complete. If q > 1, then we repeat 
the argument with 6\ instead of 6 and V t in the place of Y t and we shall be finished after 
finitely many recursions. 

If |C| = 1, then the right side of the preceding display still converges to V t as k — > oo, 
but only in the weak sense that E(V t — ( k V t -k)W — > EV t W for every square integrable 
variable W. This implies that V t is in the weak closure of lin (Y t , l*-i, . . .), but this is 
equal to the strong closure by an application of the Hahn-Banach theorem. Thus we 
arrive at the same conclusion. 

To see the weak convergence, note first that the projection of W onto the closed lin- 
ear span of {Z t :t € Z} is given by ^2jtpjZj for some sequence ipj with V. |^j| 2 < 
oo. Because V t -k € lin(i? s :s < t — is), we have |ET4— fcl^l = | J2j ipj^Vt-kZjl < 
Ysj<t-k iV'j I sd V sd Z -> as k -> oo. ■ 

7.16 Example. The moving average X t = Z t — Z t -\ is invertible in the sense of (ii), 
but not in the sense of (i). The moving average X t = Z t — 1.01i? t _i is not invertible. 

Thus X t = Z t — Z t _i implies that Z t e \in(X t ,X t _i,.. .). An unexpected phe- 
nomenon is that it is also true that Z t is contained in lin (X t+ i,X t+2 , . ..). This follows 
by time reversal: define U t = X- t +i and W t = —Z- t and apply the preceding to the 
processes U t =W t — Wt-i- Thus it appears that the "opposite" of invertibility is true as 
well! □ 

7.17 EXERCISE. Suppose that X t = 6(B) Z t for a polynomial 6 of degree q that has 
all its roots on the unit circle. Show that Z t e lin (X t+q , X t+q+ i, . . .). [As in (ii) of the 
preceding proof, it follows that V t = (~ k (V t +k — J2j=o C^t+k+j)- Here the first term on 
the right side converges weakly to zero as k — > oo.] 



100 7: ARIMA Processes 

7.4 Prediction 

As to be expected from their definitions, causality and invertibility are important for cal- 
culating predictions for ARM A processes. For a causal and invertible stationary ARM A 
process X t satisfying (f>(B)X t = 6(B)Z t we have 

X t e lin (Z t , Z t -i, ...), (causality), 
Z-t £ lin (X t , X t -i, ■ ■ .), (invertibility). 

Here lin , the closed linear span, is the operation of first forming all (finite) linear combi- 
nations and next taking the metric closure in L 2 (Q,U, P) of this linear span. Since Z t is 

a white noise process, the variable Z t +i is orthogonal to the linear span of Z t , Z t -\, 

By the continuity of the inner product it is then also orthogonal to the closed linear 
span of Z t , Z t -i, . . . and hence, under causality, it is orthogonal to X s for every s < t. 
This shows that the variable Z t +i is totally (linearly) unpredictable at time t given the 
observations X\,...,X t . This is often interpreted in the sense that the variable Z t is an 
"external noise variable" that is generated at time t independently of the history of the 
system before time t. 

7.18 EXERCISE. The preceding argument gives that Z t +i is uncorrelated with the sys- 
tem variables X t , X t -\, ... of the past. Show that if the variables Z t are independent, 
then Z t +i is independent of the system up to time t, not just uncorrelated. 

This general discussion readily gives the structure of the best linear predictor for 
causal auto-regressive stationary processes. Suppose that 

x t +i = 4>iX t -\ 1- (f> P x t -p + z t+ i. 

If t — p > 1, then X t , . . . ,X t - p are perfectly predictable based on the past variables 
Xi,...,X t ; by themselves. If the series is causal, then Z t +i is totally unpredictable, 
in view of the preceding discussion. Since a best linear predictor is a projection and 
projections are linear maps, the best linear predictor of X t +\ based on X\,...,X t is 
given by 

UtXt+i = 4>iX 1 + ■ ■ ■ + 4> p X t - P , (t>p). 

We should be able to obtain this result also from the prediction equations (2.1) and 
the explicit form of the auto-covariance function, but that calculation would be more 
complicated. 

7.19 EXERCISE. Find a formula for the best linear predictor of X t +2 based on 
X 1 ,...,X t ,iH-p>l. 

For moving average and general ARMA processes the situation is more complicated. 
Here a similar argument works only for computing the best linear predictor n_oo jt Xt+i 
based on the infinite past X t ,X t -i, . . . down to time — oo. Assume that X t is a causal 
and invertible stationary ARMA process satisfying 



Xt+i = 4>\X t + • • ■ + (f> p X t -p + Zt+i + 6\Zt + ■ ■ • + 6 q Z t +i- 



i- 
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By causality the variable Z t +i is completely unpredictable. By invertibility the variable 
Z s is perfectly predictable based on X s , X B -i, . . . and hence is perfectly predictable based 
on X t , X t -i, ... for every s < t. Therefore, 

Il-oct^t+l = 4>lX t + ■ ■ ■ + 4> p X t -p + Q\Z t H h 6qZ t +l-q. 

The practical importance of this formula is small, because we never observe the com- 
plete past. However, if we observe a long series Xi,...,X t , then the "distant past" 
Xq, X_i, . . . will not give much additional information over the "recent past" X t ,...,Xi, 
and n_ 00jt X t+ i and H t X t+ i will be close. 

7.20 EXERCISE. Suppose that cf> and 6 do not have zeros on or inside the unit circle. 
Show that E|n_ 00jt X t+ i — n t A" t+ i| 2 — > as t — > oo. [Express Z t as Z t = Y^=o n jX t -j; 
show that |-7Tj | decreases exponentially fast. The difference |n_ 00jt X t+ i — n t A" t+ i| is 
bounded above by Y.]=i \0j\\ z t+i-j ~ ^tZ t +i-j\-] 

We conclude by remarking that for causal stationary auto-regressive processes 
the square prediction error E|X t+ i — n t X t+ i| 2 is equal to Ei? 2 +1 ; for general station- 
ary ARMA-processes this is approximately true for large t; in both cases E|A" t+ i — 
II-oo,* X t +i I =EZ t+1 . 



7.5 Auto Correlation and Spectrum 

In this section we discuss several methods to express the auto-covariance function of a 
stationary ARMA-process in its parameters and obtain an expression for the spectral 
density. 

The latter is immediate from the representation X t = ip(B)Z t and Theorem 6.9. 

7.21 Theorem. The stationary ARMA process satisfying <j)(B)X t = 6{B)Z t possesses 
a spectral density given by 



fxW 



6{e- zX ) 



cf>(e- iX ) 



o 

2n' 



Finding a simple expression for the auto-covariance function is harder, except for the 
special case of moving average processes, for which the auto-covariances can be expressed 
in the parameters 6\,...,6 q by a direct computation (cf. Example 1.6 and Lemma 1.27). 
The auto-covariances of a general stationary ARMA process can be solved from a system 
of equations. In view of Lemma 1.27(iii), the equation (f>(B)X t = 6(B)Z t leads to the 
identities, with (j>(z) = X^j ^j z ^ an d 6(z) = ^- OjZ 1 *, 

£ (E &&+'-*) 7Jf(/) = * 2 EM+h. h e z - 

In principle this system of equations can be solved for 7x (/) . 
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Figure 7.1. Spectral density of the AR series satisfying X t -1.5Jf,_i +0.9X t _2 — 0.2X t s +0.1X t _g = Z t . 
(Vertical axis in decibels, i.e. it gives the logarithm of the spectrum.) 

An alternative method to compute the auto-covariance function is to write X t = 
ip{B)Z t for ip = 6/4>, whence, by Lemma 1.27(iii), 

lx{h) =a 2 ^2ipjip j+h . 

3 

This requires the computation of the coefficients ipj, which can be expressed in <j> and 9 
by comparing coefficients of the power series equation (f>(z)tp(z) = 0(z). 

7.22 Example. For the AR(1) series X t = 4>X t -i + Z t with \(f>\ < 1 we obtain tp(z) = 
{l-(j>z)- 1 = J2°L ^ ZJ - Therefore, lx {h) = a 2 ^=0 ¥¥ +h = o 2 <\> h j(\-<\?) for h > 0. 



7.23 EXERCISE. Find 7* (ft) for the stationary ARMA(1, 1) series X t = <f>X t -i +Z t + 
OZt-i with \4>\ < 1. 

* 7.24 EXERCISE. Show that the auto-covariance function of a stationary ARMA process 
decreases exponentially. Give an estimate of the constant in the exponent in terms of the 
distance of the zeros of 6 to the unit circle. 
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A third method to express the auto-covariance function in the coefficients of the 
polynomials <j> and 6 uses the spectral representation 

J-* 2tti J {z{=1 4>(z)4>(z- 1 ) 

The second integral is a contour integral along the positively oriented unit circle in the 
complex plane. We have assumed that the coefficients of the polynomials <j> and 6 are 
real, so that (j}{z)(j){z^ 1 ) = <j){z)<j){z) = \(f>(z)\ 2 for every z on the unit circle, and similarly 
for 8. The next step is to evaluate the contour integral with the help of the residue 
theorem from complex function theory. The poles of the integrand are contained in the 
set consisting of the zeros v, and their inverses v^ 1 of <j> and possibly the point 0. The 
auto-covariance function can be written as a function of the residues at these points. 

7.25 Example (ARMA(1, 1)). Consider the stationary ARMA(1, 1) series X t = 
<j>X t -i +Z t +6Z t -i with < \(f>\ < 1. The corresponding function (j}{z)(j){z^ 1 ) has zeros of 
multiplicity 1 at the points (j)^ 1 and <j>. Both points yield a pole of first order for the inte- 
grand in the contour integral. The number (j)^ 1 is outside the unit circle, so we only need 
to compute the residue at the second point. The function 6{z^ 1 )/(j){z^ 1 ) = (z+d)/(z — <j>) 
is analytic in a neighbourhood of and hence does not contribute other poles, but the 
term z h ~ x may contribute a pole at 0. For h > 1 the integrand has poles at <j> and cj)^ 1 
only and hence 

lx{h) _ „> »fl±«±ti _ „ v (i±Mfl+m 

z =<j> (1 — <pz)(l — <pz i ) 1 — qr 

For h = the integrand has an additional pole at z = and the integral evaluates to 
the sum of the residues at the two poles at z = and z = <j>. The first residue is equal to 
-6/4>. Thus 

2f (i + e4>)(i + e/4>) e\ 
-yx(o) = a[ — 2 -]. 

The values of 'yx (h) for h < follow by symmetry. □ 

7.26 EXERCISE. Find the auto-covariance function for a MA(g) process by using the 
residue theorem. (This is not easier than the direct derivation, but perhaps instructive.) 

We do not present an additional method to compute the partial auto-correlation 
function of an ARMA process. However, we make the important observation that for 
a causal AR(p) process the partial auto-correlations ax (h) of lags h > p vanish. This 
follows by combining Lemma 2.33 and the expression for the best linear predictor found 
in the preceding section. 
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7.6 Existence of Causal and Invertible Solutions 

In practice we never observe the white noise process Z t in the definition of an ARMA 
process. The Z t are "hidden variables" whose existence is hypothesized to explain the 
observed series X t . From this point of view our earlier question of existence of a stationary 
solution to the ARMA equation is perhaps not the right question, as it took the sequence 
Z t as given. In this section we turn this question around and consider an ARMA(p, q) 
process X t as given. Then we shall see that there are at least 2 p+q white noise processes 
Z t such that (f>(B)X t = 6(B)Z t for certain polynomials cf> and 9 of degrees p and q, 
respectively. (These polynomials depend on the choice of Z t and hence are not necessarily 
the ones that are initially given.) Thus the white noise process Z t is far from being 
uniquely determined by the observed series X t . On the other hand, among the multitude 
of solutions, only one choice yields a representation of X t as a stationary ARMA process 
that is both causal and invertible. 

7.27 Theorem. For every stationary ARMA process X t satisfying (f>(B)X t = 6(B) Z t for 
polynomials cf> and 9 such that 9 has no roots on the unit circle, there exist polynomials 
4>* and 9* of the same or smaller degrees as cf> and 9 that have all roots outside the unit 
disc and a white noise process Z% such that (f>*(B)X t = 6*(B)Z t almost surely for every 

tez. 

Proof. The existence of the stationary ARMA process X t and our implicit assumption 
that <j> and 9 are relatively prime imply that <j> has no roots on the unit circle. Thus 
all roots of 4> and 9 are either inside or outside the unit circle. We shall show that we 
can move the roots inside the unit circle to roots outside the unit circle by a filtering 
procedure. Suppose that 

cj>(z) = -<j) P (z -vi)---(z - Up), 9(z) = 8 q (z -wi)---(z- w q ). 

Consider any zero Z; of cf) or 9. If |z;| < 1, then we replace the term (z — Zi) in the 
above products by the term (1 — ~z~iz); otherwise we keep (z — z^. For Zi = 0, this means 
that we drop the term z — Zi and the degree of the polynomial decreases; otherwise, the 
degree remains the same. We apply this procedure to all zeros Vi and Wi and denote the 
resulting polynomials by (ft* and 9*. Because < \zi\ < 1 implies that \~zj l \ > 1, the 
polynomials (ft* and 9* have all zeros outside the unit circle. We have that 

9(z) 9*{z) y-r 1-ViZ -i-r Z-Wi 



»w. "W- n -^ n 



4>(z) 4>*(z) ' J-l Z - Vi ±1 ■ \- WiZ 

t:|«i|<l t:|ujj|<l 

Because X t = (0/</>)(B)Z t and we want that X t = (9* /</>*)(B)Zf, we define the process 
Z* by Zf = n(B)Z t . This is to be understood in the sense that we expand k(z) in its 
Laurent series k(z) = X^j K i- Z "' an d apply the corresponding linear filter to Z t . 

By construction we now have that <j>*(B)X t = 9*(B)Z t . If \z\ = 1, then |1 — ztz\ = 
\z — Zi\. In view of the definition of k this implies that \n(z)\ = 1 for every z on the unit 
circle and hence the spectral density of Z t * satisfies 

f z , { \) = \ K{ e-*)\ 2 f z {\) = l-^-. 
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This shows that Z% is a white noise process, as desired. ■ 

As are many results in time series analysis, the preceding theorem is a result on 
second moments only. Even if Z t is an i.i.d. sequence, then the theorem does not guarantee 
that Z% is an i.i.d. sequence as well. Only first and second moments are preserved by the 
filtering procedure in the proof, in general. Nevertheless, the theorem is often interpreted 
as implying that not much is lost by assuming a-priori that <j> and 6 have all their roots 
outside the unit circle. 

7.28 EXERCISE. Suppose that the time series Z t is Gaussian. Show that the series Z% 
constructed in the preceding proof is Gaussian and hence i.i.d.. 



* 7.7 Stability 

Let cf) and 6 be polynomials, with </> having no roots on the unit circle. Given initial 
values Xi,...,X p and a process Z t , we can recursively define a solution to the ARM A 
equation <f>{B)X t = 6{B)Z t by 

X t = faXt-i + ■■■ + <\> P X t - P + 6{B)Z U t > p, 

Xt-p = <t> P l {X t - faXt-i (t>p-iX t - P +i - 6{B)Z t ), t-p<l. 

In view of Theorem 7.8 the resulting process X t can only be bounded in L^ if the 
initial values Xi,...,X p are chosen randomly according to the stationary distribution. 
In particular, the process X t obtained from deterministic initial values must necessarily 
be unbounded (on the full time scale feZ). 

In this section we show that in the causal situation, when cf) has no zeros on the unit 
disc, the process X t tends to stationarity as t — > oo. Hence in this case the unboundedness 
occurs as t — > — oo. This is another reason to prefer the case that <j> has no roots on the 
unit disc: in this case the effect of initializing the process wears off as time goes by. 

Let Z t be a given white noise process and let (Xi , . . . , X p ) and (Xi , . . . , X p ) be 
two possible sets of initial values, consisting of random variables defined on the same 
probability space. 

7.29 Theorem. Let <fi and 8 be polynomials such that <j> has no roots on the unit disc. 

Let X t and X t be the ARMA processes with initial values {X\ , . . . , X p ) and {X\ , . . . , X p ) , 
respectively. Then X t — X t — > almost surely as t — > oo. 

7.30 Corollary. Let </> and 6 be polynomials such that <j> has no roots on the unit disc. 
If X t is an ARMA process with arbitrary initial values, then the vector (X t , . . . , X t +k) 
converges in distribution to the distribution of the stationary solution to the ARMA 
equation, as t — > oo, for every fixed k. 

Proofs. For the corollary we take {X\, . . . ,X P ) equal to the values of the stationary 
solution. Then we can conclude that the difference between X t and the stationary solution 
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converges almost surely to zero and hence the difference between the distributions tends 
to zero. 

For the proof of the theorem we write the ARMA relationship in the "state space 
form" , for t > p, 






1 



o 



Op-l 





A 



V 







o7 



X t -2 



+ 



\Xt-J 



/0(B)Z t \ 



l ) 



\X t -p+i I 
Denote this system by Y t = $l*-i + b t . By some algebra it can be shown that 

det($-zl) = (-l) p z p (p(z- 1 ), z^O. 

Thus the assumption that <j> has no roots on the unit disc implies that the eigenvalues 
of $ are all inside the unit circle. In other words, the spectral radius of $, the maximum 
of the moduli of the eigenvalues, is strictly less than 1. Because the sequence H^H 1 /™ 
converges to the spectral radius as n — ¥ oo, we can conclude that 11$™ H 1 /™ is strictly less 
than 1 for all sufficiently large n, and hence ||$™|| — ¥ as n — ¥ oo. 

If Y t relates to X t as Y t relates to X t , then Y t -Y t = ^ t ^ p (Y p - Y p ) -> almost 
surely as t — ¥ oo. ■ 

7.31 EXERCISE. Suppose that <j>(z) has no zeros on the unit circle and at least one 
zero inside the unit circle. Show that there exist initial values (Xi, . . . , X p ) such that the 
resulting process X t is not bounded in probability as t — ¥ oo. [Let X t be the stationary 
solution and let X t be the solution given initial values (Xi, . . . , X p ). Then, with notation 
as in the preceding proof, Y t — Y t = ^ t ^ p (Y p — Y p ). Choose an appropriate deterministic 
vector for Y p — Y p ] 



7.8 ARIMA Processes 

In Chapter 1 differencing is introduced as a method to transform a nonstationary time 
series in a stationary one. This method is particularly attractive in combination with 
ARMA modelling: in the notation of the present chapter the differencing filters can be 
written as 

WX t = (l-B)X u V d X t = (l-B) d X u V k X t = (l-B k )X t . 

Thus the differencing filters V, V d and Vfe correspond to applying (j>{B) for the polynomi- 
als <j>(z) = 1 — z, <j){z) = (1 — z) d and <j>(z) = (1 — z k ), respectively. These polynomials have 
in common that all their roots are on the complex unit circle. Thus they were "forbidden" 
polynomials in our preceding discussion of ARMA processes. In fact, by Theorem 7.10, 
for the three given polynomials <j> the series Y t = cf)(B)X t cannot be a stationary ARMA 
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process if X t is already a stationary ARMA process relative to polynomials without zeros 
on the unit circle. 

On the other hand, Y t = (f>(B)X t can well be a stationary ARMA process if X t is a 
non-stationary time series. Thus we can use polynomials (f> with roots on the unit circle 
to extend the domain of ARMA modelling to nonstationary time series. 

7.32 Definition. A time series X t is an ARIMA(p, d, q) process ifV d X t is a stationary 
ARMA{p, q) process. 

In other words, the time series X t is an ARIMA(p, d, q) process if there exist poly- 
nomials <j) and 6 of degrees p and q and a white noise series Z t such that the time 
series V d X t is stationary and 4>(B)V d X t = 6(B)Z t almost surely. The additional "I" 
in ARIMA is for "integrated" . If we view taking differences V d as differentiating, then 
the definition requires that a derivative of X t is a stationary ARMA process, whence X t 
itself is an "integrated ARMA process" . 

The following definition goes a step further. 

7.33 Definition. A time series X t is a SARIMA(p,d,q)(P,D,Q,per) process if there 
exist polynomials <j), 6, $ and © of degrees p, q, P and Q and a white noise series 
Z t such that the time series V^ er V d X t is stationary and ^(B per )<p(B)V^ er V d X t = 
<d{B per )6{B)Z t almost surely 

The "S" in SARIMA is short for "seasonal" . The idea of a seasonal model is that 
we might only want to use certain powers B per of the backshift operator in our model, 
because the series is thought to have a certain period. Including the terms <&(B per ) and 
Q(B per ) does not make the model more general (as these terms could be subsumed in 
(f>(B) and 6(B)), but reflects our a-priori idea that certain coefficients in the polynomials 
are zero. This a-priori knowledge will be important when estimating the coefficients from 
an observed time series. 

Modelling an observed time series by an ARIMA, or SARIMA, model has become 
popular through an influential book by Box and Jenkins. The unified filtering paradigm 
of a "Box- Jenkins analysis" is indeed attractive. The popularity is probably also due to 
the compelling manner in which Box and Jenkins explain the reader how he or she must 
set up the analysis, going through a fixed number of steps. They thus provide the data- 
analyst with a clear algorithm to carry out an analysis that is intrinsically difficult. It is 
obvious that the results of such an analysis will not always be good, but an alternative 
is less obvious. 

7.34 EXERCISE. Plot the spectral densities of the following time series: 
(i) X t =Z t + 0.9Z t - 1 ; 

(ii) X t = Zt - 0.9Z t -i; 

(iii) X t - 0.7AVi = Z t ; 

(iv) X t + Q.7Xt-i=Z t] 

(v) X t - 1.5AVi + 0.9AV 2 - 0.2AV 3 + 0.1X t _ 9 = Z t . 
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7.35 EXERCISE. Simulate a series of length 200 according to the model X t - 1.3AVi + 
0.7X t _2 = Z t + 0.7Z t -i. Plot the sample auto-correlation and sample partial auto- 
correlation functions. 



* 7.9 VARMA Processes 

A VARMA process is a vector-valued ARMA process. Given matrices $j and 0j and a 
white noise sequence Z t of dimension d, a VARMA(p, q) process satisfies the relationship 

X t = §iX,_i + $ 2 X t - 2 + ■■■ + $pX t - p + Z t + 0iZ t _i + • • ■ + o q z t _ q . 

The theory for VARMA process closely resembles the theory for ARMA processes. The 
role of the polynomials <j> and 6 is taken over by the matrix-valued polynomials 

$(z) = 1 - $iZ - $ 2 -2 2 $ P Z P , 

Q(z) = 1 + 0i z + 2 z 2 + ■ ■ ■ + 0,z«. 

These identities and sums are to be interpreted entry-wise and hence define (d x d)- 
matrices with entries that are polynomials in z € C. 

Instead of looking at zeros of polynomials we must now look at the values of z for 
which the matrices 3>(z) and 0(z) are singular. Equivalently, we must look at the zeros 
of the complex functions z i-> det $(z) and z i-> det 0(z). Apart from this difference, the 
conditions for existence of a stationary solution, causality and invertibility are the same. 

7.36 Theorem. If the matrix-valued polynomial $(z) is invertible for every z in the 
unit circle, then there exists a unique stationary solution X t to the VARMA equations. 
If the matrix-valued polynomial $(z) is invertible for every z on the unit disc, then this 
can be written in the form X t = X^jlo ^j^t-j for matrices ^j with Yl'jLo 11*7 II < °°- 
If, moreover, the polynomial 0(z) is invertible for every z on the unit disc, then we also 
have that Z t = Yl'jLo ^j^t-j for matrices Uj with Y^JLo W^-jW < °°- 

The proof of this theorem is the same as the proofs of the corresponding results in 
the one-dimensional case, in view of the following observations. 

A series of the type J27L-QO ^jZt-j for matrices ^j with X^jlo ll*jll < °° an ^ a 
process Z t with sup t E||Z t || < oo converges almost surely and in mean. Furthermore, the 
analogue of Lemma 7.2 is true. 

The functions z i-> det$(z) and z i-> det0(z) are polynomials. Hence if they are 
nonzero on the unit circle, then they are nonzero on an open annulus containing the unit 
circle, and the matrices $(z) and 0(z) are invertible for every z in this annulus. Cramer's 
rule, which expresses the solution of a system of linear equations in determinants, shows 
that the entries of the inverse matrices $(z) _1 and 0(z) _1 are quotients of polynomials. 
The denominators are the determinants det 3>(z) and det 0(z) and hence are nonzero in 
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a neighbourhood of the unit circle. These matrices may thus be expanded in Laurent 
series' 



J = — CSO J = — CSO 

where the \l>j are matrices such that X^jlo ll^ill < °°> an d similarly for ©(z) -1 . 
The norm || ■ || in the preceding may be any matrix norm. 



8 

GARCH Processes 



White noise processes are basic building blocks for time series models, but can also 
be of interest on their own. A sequence of i.i.d. variables is an example of a white 
noise sequence, but is not of great interest as a time series. On the other hand, many 
financial time series are white noise series, but are not well-described by i.i.d. sequences. 
This is possible because the white noise property only concerns the second moments of 
the process, so that the variables of a white noise process may possess many types of 
dependence. GARCH processes are a class of white noise sequences that have been found 
useful for modelling certain financial time series. 

Figure 8.1 shows a realization of a GARCH process. The striking feature are the 
"bursts of activity", which alternate with quiet periods of the series. Here the frequency 
of the movements of the series is constant over time, but their amplitude changes, alter- 
nating between "volatile" periods (large amplitude) and quiet periods, so-called volatility 
clustering. A look at the auto-correlation function of the realization, Figure 8.2, shows 
that the alternations are not reflected in the second moments of the series: the series can 
be modelled as white noise, at least in the sense that the correlations are zero. 

Recall that a white noise series is any stationary time series whose auto-covariances 
at nonzero lags vanish. We shall speak of a heteroscedastic white noise if the auto- 
covariances at nonzero lags vanish, but the variances are time-dependent. A related con- 
cept is that of a martingale difference series. Recall that a filtration F t is a nondecreasing 
collection of <r-fields • • • C T-\ C To C T\ C ■ ■ -. A martingale difference series relative 
to a filtration is a time series X t such that X t is J-j-measurable and E(X t \ Tt-i) = 
almost surely for every t. The latter includes the assumption that E|X t | < oo, so that 
the conditional expectation is well-defined. 

Any martingale difference series X t with finite second moments is a (possibly het- 
eroscedastic) white noise series. Indeed, the equality E(X t | F t -i) = is equivalent to 
X t being orthogonal to all random variables Y £ Ft-i, and this includes the variables 
I s £ J s C -T^-i, for every s < t, so that EX t X s = for every s < t. Conversely, not 
every white noise is a martingale difference series (relative to a natural filtration). This 
is because E(X\ Y) = implies that X is orthogonal to all measurable functions of Y, 
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Figure 8.1. Realization of length 500 of the stationary Garch(l, 1) process with a = 0.15, <p\ = 0.4, 0\ — 0.4 
and standard normal variables Zt . 

not just orthogonal to linear functions. 

8.1 EXERCISE. If X t is a martingale difference series, show that E(X t+ kX t+ i\ F t ) = 
almost surely for every k ^ I > 0. Thus "future variables are uncorrelated given the 
present" . 

A martingale difference sequence has zero first moment given the past. A natural step 
for further modelling is to postulate a specific form of the conditional second moment. 
GARCH models are examples, and in that sense are again concerned only with first 
and second moments of the time series, albeit conditional moments. They turn out to 
capture many features of observed time series, in particular those in finance, that are 
not captured by ARM A processes. 



8.1 Linear GARCH 

There are many types of GARCH processes, of which we discuss a selection in the 
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Figure 8.2. Sample auto-covariance function of the time series in Figure 8.1. 



following sections. Linear GARCH processes were the earliest GARCH processes to be 
studied, and may be viewed as the GARCH processes. 



8.2 Definition. A GARCH (p, q) process is a martingale difference sequence X t relative 

■ ,6q, 



to a given filtration !F t whose conditional variances of = E(A" 2 | Ft-i) satisfy, for every 
t € Z and given constants ct,<f>i,..., <f> p , 8\, . 



(8.1) 



0", 



a + 4> x u^ x + ■■■ + <py t _ p + fliX^ + ■ ■ ■ + e q x 2 _ q 



a.s. 



With the usual convention that <f>{z) = l — faz- 



-4) p z p and 6{z) = 6iz+- ■ -+6 q z q , 
the equation for the conditional variance of = var(A" t | IFt-i) can be abbreviated to 

4>{B)a 2 t =a + e{B)X 2 t . 



Note that the polynomial 9 is assumed to have zero intercept. If the coefficients <j>\ , . . . , 4> p 
all vanish, then of is modelled as a linear function of X 2 _ Xl . . . , X 2 _ q . This is called an 
ARCH (q) model, from "auto-regressive conditional heteroscedastic" . The additional G 
of GARCH is for the nondescript "generalized" . 

If <7 t > 0, then we can define Z t = X t /a t . The martingale difference property of 
X t = a t Z t and the definition of of as the conditional variance imply 



(8.2) 



E(2t|^_i)=0, 



E(Z t 2 |^ t _ 1 ) = l. 



Conversely, given an adapted process Z t satisfying this display (a "scaled martingale 
difference process") and a process a t that is Ft-i -measurable we can set X t = a t Z t . 



* 
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Then a t is the conditional variance of X t and the process X t is a GARCH process if (8.1) 
is valid. It is then often added as an assumption that the variables Z t are i.i.d. and that 
Z t is independent of Tt-\- This is equivalent to assuming that the conditional law of the 
variables Z t = X t /a t given T t -\ is a given distribution, for instance a standard normal 
distribution. In order to satisfy (8.2) this distribution must have a finite second moment, 
but this is not strictly necessary for all of the following. The "conditional variances" in 
Definition 8.2 may be understood in the general sense that does not require that the 
variances EA 2 are finite. 

If we substitute of = X? — W t in (8.1), then we find after rearranging the terms, 

x? = a + (fa + e^xl, + ... + (fa + e r )xl r 

1 ' J + W t -faW t -i faW t - P , 

where r = pV q and the sequences fa , . . . , fa or G\ , . . . , G q are padded with zeros to 
increase their lengths to r, if necessary. We can abbreviate this to 

(4> - 6)(B)X? =a + <j>(B)W u W t = A 2 - E(A 2 | T t -i). 

This is the characterizing equation for an ARMA(r, r) process X? relative to the noise 
process W t . The variable W t = X? — a 2 is the prediction error when predicting X? by 
its conditional expectation a 2 = E(A 2 | Ft-i) and hence W t is orthogonal to Tt-i- Thus 
W t is a martingale difference series. The second moments of W t involve fourth order 
moments of X t and do not necessarily exist under the assumptions we have made sofar, 
and if they exist, then they are not necessarily independent of t. It follows that W t is not 
necessarily a white noise and X? is not an ARMA process in the sense of Definition 7.4. 
A further warning against applying the results on ARMA processes unthinkingly to 
the process A" 2 is that W t is defined itself in terms of the process X t and therefore does 
not have a simple interpretation as a noise process that drives the process A 2 . 

8.3 EXERCISE. Suppose that X t and W t are martingale diffference series' relative to a 
given filtration such that faB)X 2 = 6(B)W t for polynomials cf> and 6 of degrees p and 
q. Show that X t is a GARCH process. Does strict stationarity of the time series A 2 or 
W t imply the strict stationarity of the time series X{t 

8.4 EXERCISE. Write of as an ARMA(Vg, q - 1) model by substituting A 2 = ct 2 + W t . 

Alternatively, we can substitute X t = a t Z t in the GARCH relation (8.1) and obtain 

(8.4) a 2 = a + (fa + Q x Zl_ x )a\_ x + ■ ■ ■ + (fa + e r Z 2 t _ r )o 2 t _ r . 

This exhibits the process of as an auto-regressive process "with random coefficients 
and deterministic innovations". This relation is useful to construct GARCH processes. 
In the following theorem we consider given a martingale difference sequence Z t as in 
(8.2), defined on a fixed probability space. Next we construct a GARCH process such 
that X t = a t Z t by first defining the process of squares of in terms of the Z t . If the 
coefficients a,fa,8j are nonnegative we obtain a stationary solution if the polynomial 
1 — X^=i(^j + 6j)zi possesses no zeros on the unit disc. Under the condition that the 
coefficients are nonnegative, the second is equivalent to X^(<Aj +^j) < 1- 
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8.5 EXERCISE. If p\, . . . ,p r are nonnegative real numbers, then the polynomial p(z) = 
1 — Yl^iPj 2 '' possesses no roots on the unit disc if and only if p(l) > 0. [Use that 
p(0) = 1 > 0; furthermore, use the triangle inequality] 

8.6 Theorem. Let a > 0, let cf>i, . . . , (f> p , 9\, . . . , 9 q be nonnegative numbers, and let Z t 
be a martingale difference sequence satisfying (8.2) relative to an arbitrary filtration T t ■ 

(i) There exists a stationary GARCH process X t such that X t = a t Z t , where of = 

E(X t 2 | F t -i), if and only ifZj(4>j + 0j) < 1- 
(ii) This process is unique among the GARCH processes X t with X t = o t Z t that are 

bounded in I/ 2 ■ 
(Hi) This process satisfies <r(X t ,X t -i, ■ ■ .) = <r(Z t ,Z t -i, ■ ■ .) for every t, and of = 

E(X t 2 | T t -\) is a(X t -i,X t -2, ■ ■ ■) -measurable. 

Proof. Assume first that ~Y^A<j>j + 6j) < 1. Furthermore, assume that there exists a 
GARCH process X t that is bounded in L 2 . Then the conditional variance of is bounded 
in L\ and satisfies, by (8.4), 
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Write this system as Y t = A t Y t _i +b and set A = EA t . With some effort it can be shown 
that 

T 

det(A - zl) = (-iy(z r - Yttti+W*)- 

j=i 

If ~Y^A<j>j + 6j) < 1, then the polynomial on the right has all its roots inside the unit 
circle. Equivalently, the spectral radius (the maximum of the moduli of the eigenvalues) 
of the operator A is strictly smaller than 1. (See Exercise 8.5.) This implies that \\A n \\ 
is smaller than 1 for all sufficiently large n and hence S^Lo 11^™ 1 1 < °°- 
Iterating the equation Y t = A t Y t -i + b we find that 

(8.5) Y t = b + A t b + AtAt-ib + ■■■ + A t A t -i ■ ■ ■ A t - n +ib + A t A t -i ■ ■ ■ A t - n Yt- n -i- 

Because Z t = X t ja t is J-j-measurable and E(i? t 2 | Tt-i) = 1 for every t, we have that 
EZ t 2 ■ • • Z% k = 1, for every t\ < i 2 < ■ ■ ■ < tk- By some matrix algebra it can be seen 
that this implies that 



EA t A t -! ■ ■ ■ A t -r 



171+1 



->o, 



n — > oo. 



Because the matrices A t possess nonnegative entries, this implies that the sequence 
A t A t -i ■ ■ ■ A t - n converges to zero in probability. If the process X t is bounded in L>2, 
then, in view of its definition, the process Y t is bounded in L\. We conclude that 
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A t At-i---At- n Yt-n-i — > in probability as n — > oo. Combining this with the ex- 
pression for Y t in (8.5), we see that 

oo 

(8.6) Y t =b + Y,AtA t -i---A t - j+1 b. 

j=i 

This implies that EY t = Yl'jLo ^b, whence EY t and hence EA" t 2 are independent of t. 

Because the matrices A t are measurable functions of (Z t -i, Z t - 2 , ■ ■■), the variable 
It is a measurable transformation of these variables as well, and hence the variable 
X t = (T t Z t is a measurable transformation of (Z t , Z t ~i, ■ ■ ■)• 

The process W t = X? — of is bounded in L\ and satisfies the ARMA relation ((/> — 
8)(B)Xf = a + (f>(B)W t as in (8.3). Because <j> has no roots on the unit disc, this relation 
is invertible, whence W t = (l/cj))(B) ^((f)—8)(B)Xf — a) is a measurable transformation of 
Xf, X 2 _ 1; .... We conclude that of = W t +Xf and hence Z t = X t /a t are a(X t , X t -i, . . .)- 
measurable. Since of is a(Z t _i,Z t _ 2 , . . .)-measurable by the preceding paragraph, it 
follows that it is a(X t _i,X t _ 2 , . . .) -measurable. 

We have proved that any GARCH process X t that is bounded in L 2 defines a condi- 
tional variance process of and corresponding process Y t that satisfies (8.6). Furthermore, 
we have proved (iii) for this process. 

We next construct a GARCH process X t by reversing the definitions, still assuming 
that ^2j((f>j + 6j) < 1. We define matrices A t in terms of the process Z t as before. The 
series on the right of (8.6) converges in L\ and hence defines a process Y t . Simple algebra 
shows that this satisfies Y t = A t Y t -i +6 for every t. All coordinates of Y t are nonnegative 
and (j(Z t _i, Z t -2, ■ ■ .)-measurable. 

Given the processes (Z t ,Y t ) we define processes (X t ,a t ) by 

0* = y/Yt,i, X t = a t Z t . 

Because a t is a(Z t -i,Z t -2,---) C J-t-i-measurable, we have that E(X t |J r t _i) = 
a t E(Zt\Ft-i) = and E(X?\F t -i) = a^E(Z^\^t-i) = o\. That a t 2 satisfies (8.1) is 
a consequence of the relations Y t = A t Y t -i + b t , whose first line expresses of into a\_ x 
and lt-1,2, ■ ■ ■ , lt_i >7 ., and whose other lines permit to reexpress the Yt-\,k as o\_ k by 
recursive use of the relations Y ty k = 5^-i,*-i, and the definitions Y ty \ = rf. 

This concludes the proof that there exists a stationary solution as soon as Ylj (9j + 
6j) < 1. Finally, we show that this inequality is necessary. If X t is a stationary solution, 
then Y t in (8.5) is integrable. Taking the expectation of left and right of this equation 
for t = and remembering that all terms are nonnegative, we see that X}j=o -^b < EY a , 
for every n. This implies that A n b ->0asn->oo, or, equivalently A n e\ — > 0, where e* 
is the ith unit vector. In view of the definition of A we see, recursively, that 

A n e r =A n - l {(j> r +6 r )ei -> 0, 
A n e r -i = A n ~ l ((4> T -i + Qr-i)ei + e T ) ->■ 0, 



A n e 2 = A n - 1 ((0a + 6 2 )e l + e 3 ) -»■ 0. 
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Therefore, the sequence A n converges to zero. This can only happen if none of its eigen- 
values is on or outside the unit disc. Equivalently, if the polynomial 1 — X^LiWj +^j)- 2 ' 3 
possesses no roots on or inside the unit disc. ■ 

Volatility clustering is one of the stylized facts of financial time series, and it is 
captured by GARCH processes: large absolute values of a GARCH series at times t — 
1, . . . ,t — q lead, through the GARCH equation, to a large conditional variance of at 
time t, and hence the value X t = a t Z t of the time series at time t tends to be large. 

A second stylized fact are the leptokurtic tails of the marginal distribution of a 
typical financial time series. A distribution on M is called leptokurtic if it has fat tails, 
for instance fatter than normal tails. A quantitative measure of "fatness" of the tails 
of the distribution of a random variable is the kurtosis defined as k±{X) = E(X — 
EX) 4 /(var X) 2 . It is equal to 3 for a normally distributed variable. If X t = a t Z t , where 
<7 t is T t -\ -measurable and Z t is independent of Tt-\ with mean zero and variance 1, 
then 

EXf = Eo-fEZf = K 4 (Z t )E(E(Xf\^ t ^)) 2 > K 4 (Z t )(EE(X 2 \T t ^)) 2 = K 4 (Z t )(EX?) 2 . 

Dividing the left and right sides by (EX 2 ) 2 , we see that n 4 {X t ) > n 4 {Z t ). The difference 
can be substantial if the variance of the random variable E(X 2 | Ft-i) is large. In fact, 
taking the difference of the left and right sides of the preceding display yields 



Ki (X t ) = Ki (Z t )(l + 



varE(X 2 |^ t _i)~ 



(EX 2 ) 



2^2 



It follows that the GARCH structure is also able to capture some of the observed lep- 
tokurtosis of financial time series. 

If we use a Gaussian process Z t , then the kurtosis of the observed series X t is always 
bigger than 3. It has been observed that this usually does not go far enough in explaining 
"excess kurtosis" over the normal distribution. The use of one of Student's t-distributions 
can often improve the fit of a GARCH process substantially. 

A third stylized fact observed in financial time series are positive auto-correlations 
for the sequence of squares X 2 . The auto-correlation function of the squares of a GARCH 
series will exist under appropriate additional conditions on the coefficients and the driving 
noise process Z t . The ARM A relation (8.3) for the square process X 2 may be used to 
compute this function, using formulas for the auto-correlation function of an ARMA 
process. Here we must not forget that the process W t in (8.3) is defined through X t and 
hence its variance depends on the parameters in the GARCH relation. 

8.7 Example (GARCH(1, 1)). The conditional variances of a GARCH(1,1) process 
satisfy of = a + (f>o' 2 _ 1 + 6X 2 _ l . If we assume the process X t to be stationary, then 
Ea 2 = EX 2 is independent of t. Taking the expectation across the GARCH equation 
and rearranging then immediately gives 
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To compute the auto-correlation function of the time series of squares X 2 , we employ 
(8.3), which reveals this process as an ARMA(1,1) process with the auto-regressive and 
moving average polynomials given as 1 — (<j>+d)z and l — <j>z, respectively. The calculations 
in Example 7.25 yield that 

* ( h \ su, l/M (W(0 + 0))(i-W + 0)) .. n 

7x2 (h) = T (4> + 0) i_tA ) + ff\2 ' h>0, 

„ , n , 2 f (i-4>(4> + om-4>/(4> + 0)) , 4> \ 
7 ^ 2(0) = r ( i-(4> + ey + JTe)- 

Here r 2 is the variance of the process W t = X 2 — E(X 2 | T t -i), which is also dependent 
on the parameters 6 and <j>. By squaring the GARCH equation we find 



„2 



+ 0V-i + Pxt-i + 2<*<t><?U + 2a6X 2 _ 1 + 2<j>6aUxli- 



If Z t is independent of Tt-i, then Eaf A" 2 = Eaf and EA^ = K4(Z t )Eaf. If we assume, 
moreover, that the moments exists and are independent of t, then we can take the 
expectation across the preceding display and rearrange to find that 

E<7 t 4 (l - <j> 2 - 2<j>6 - k 4 (Z)6 2 ) =a 2 + {2a<j> + 2a6)Ea 2 . 



Together with the formulas obtained previously, this gives the variance of W t = X 2 — 
E(X t |.F t _i), since EW t = and EW t 2 = EA t 4 - Ect 4 , by the Pythagorean identity for 
projections. □ 

8.8 EXERCISE. Find the auto-covariance function of the process of for a GARCH(1, 1) 

process. 

8.9 EXERCISE. Find an expression for the kurtosis of the marginal distribution in a sta- 
tionary GARCH(1, 1) process as in the preceding example. Can this be made arbitrarily 
large? 

The condition that ~Y^A<j>j +0j) < 1 is necessary for existence of a GARCH process 
with bounded second moments, but stronger than necessary if we are interested in a 
strictly stationary solution to the GARCH equations, and do not require the existence of 
moments. We can see this from the proof of the preceding theorem, where the GARCH 
process is defined from the series in (8.6). If this series converges in an almost sure 
sense, then a strictly stationary GARCH process exists. The series involves products 
of random matrices A t ; its convergence depends on the value of their top Lyapounov 
exponent, defined by 

7= inf -Elog H.4-1,4-2 ---^-nll. 

nENn 

Here || ■ || may be any matrix norm (all matrix norms being equivalent). If the process 
Z t is ergodic, for instance i.i.d., then we can apply Kingman's subergodic theorem (e.g. 
Dudley (1987, Theorem 10.7.1)) to the process log ||,4_i,4_2 • • • A- n \\ to see that 

-log||,4_i,4_2---,4-„|| ->7, a.s.. 
n 
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This implies that the sequence of matrices A-1A-2 ■ ■ ■ A- n converges to zero almost 
surely as soon as 7 < 0. The convergence is then exponentially fast and the series in 
(8.6) will converge. 

Thus sufficient conditions for the existence of strictly stationary solutions to the 
GARCH equations can be given in terms of the top Lyapounov exponent of the random 
matrices A t . This exponent is in general difficult to compute explicitly, but it can easily 
be estimated numerically for a given sequence Z t . 

To obtain conditions that are both sufficient and necessary the preceding proof must 
be adapted somewhat. The following theorem is in terms of the top Lyapounov exponent 
of the matrices 



(fa +9^ 
1 




(8.7) 
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These matrices have the advantage of being independent and identically distributed if 
the process Z t is i.i.d.. They are motivated by the equation obtained by substituting 
X t _i = (Jt-iZt-i in the GARCH equation (8.1), leaving X t _ 2 , • • • ,X t _ q untouched: 



a\ = a + (fa + OiZ^Wt-i + <h°?-i + ■■■ + <f> P °t- P + ^X^_ 2 + ---+d q X 



t-q- 



This equation gives rise to the system of equations Y t = A t Y t -i + b for the random 
vectors Y t = (of, . . . , of_ p+1 , Xf_ lt . . . , Xf_ q+1 ) T and the vector b equal to a times the 
first unit vector. 



8.10 Theorem. Let a > 0, let fa, 



,6i,...,6 q be nonnegative numbers, and let 



Z t be an i.i.d. sequence with mean zero and unit variance. There exists a strictly 
stationary GARCH process X t such that X t = a t Z t , where of = E(A" t 2 | Tt-i) ssid 
F t = o{Z t , Z t -i, ■■■),, if and only if the top Lyapounov coefficient of the random matrices 
A t given by (8.7) is strictly negative. For this process a(X t , X t -\, • ■ ■) = o{Z t ,Z t -\, . . .). 



Proof. Let b = cte\, where e^ is the ith unit vector in M p+9_1 . If 7' is strictly larger than 
the top Lyapounov exponent 7, then ||yl t yl t _i ■ ■ -ylt-n+iH < e 7 ™, eventually as n — > 00, 
almost surely, and hence, eventually, 

|| J 4 tJ 4 t _ 1 --- J 4 t _ n+1 6||< e T' n ||6||. 

If 7 < 0, then we may choose 7' < 0, and hence ^2 n \\A t A t -i ■ ■ ■ A t - n +ib\\ < 00 almost 
surely. Then the series on the right side of (8.6), but with the matrix A t defined as in 
(8.7), converges almost surely and defines a process Y t . We can then define processes by 
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setting a t = \JYt,\ and X t = a t Z t . That these processes satisfy the GARCH relation 
follows from the relations Y t = A t Y t -i + b, as in the proof of Theorem 8.6. Being a fixed 
measurable transformation of (Z t , Z t -i, ■ ■ .) for each t, the process (a t ,X t ) is strictly 
stationary. 

By construction the variable X t is cr{Z t , Z t -i, ■ . .)-measurable for every t. To see 
that, conversely, Z t is a(X t ,X t _i, . . .)-measurable, we apply a similar argument as in 
the proof of Theorem 8.6, based on inverting the relation ((/> — 6)(B)Xf = a + <j>(B)W t , 
for W t = X? — of. Presently, the series' X? and W t are not necessarily integrable, 
but Lemma 8.11 below still allows to conclude that W t is <r(A" 2 , A" 2 _ 1; . . .)-measurable, 
provided that the polynomial <j> has no zeros on the unit disc. 

The matrix B obtained by replacing the variables Z t -\ and the numbers 6j in the 
matrix A t by zero is bounded above by A t in a coordinatewise sense. By the nonnegativity 
of the entries this implies that B n < A A-\ ■ ■ ■ A- n+ i and hence B n — > 0. This can 
happen only if all eigenvalues of B are inside the unit circle. Indeed, if A is an eigenvalue 
of B with |A| > 1 and c/Oa corresponding eigenvector, then B n c = X n c does not 
converge to zero. Now 

P+8-1 

det(B - zl) = (-iy+1- 1 (zP+i- 1 - Y, 4> ] z p+q - 1 - : >). 

Thus z is a zero of <j> if and only if z^ 1 is an eigenvalue of B. We conclude that <j> has no 
zeros on the unit disc. 

Finally, we show the necessity of the top Lyapounov exponent being negative. If 
there exists a strictly stationary solution to the GARCH equations, then, by (8.5) and the 
nonnegativity of the coefficients, X^Li ^0^4-1 ■ ■ ■ A- n+ \b < Y for every n, and hence 
AqA-i ■ ■ ■ A- n+ \b — > as n — > oo, almost surely. By the form of b this is equivalent 
to A A-i ■ ■ ■ A- n+ iei — > 0. Using the structure of the matrices A t we next see that 
^o^-i • ■ ■ ^4-™+i — > in probability as n — > oo, by an argument similar as in the 
proof of Theorem 8.6. Because the matrices A t are independent and the event where 
A a A_i ■ ■ ■ A_ n+ i — > is a tail event, this event must have probability one. It can be 
shown that this is possible only if the top Lyapounov exponent of the matrices A t is 
negative." ■ 

8.11 Lemma. Let <f> be a polynomial without roots on the unit disc and let X t be 
a time series that is bounded in probability. If Z t = cf>(B)X t for every t, then X t is 
o{Z t , Z t _i, . . .)-measurable. 

Proof. Because ^(0) ^ by assumption, we can assume without loss of generality 
that (j> possesses intercept 1. If <j> is of degree 0, then X t = Z t for every t and the 
assertion is certainly true. We next proceed by induction on the degree of <j>. If 4> is of 
degree p > 1, then we can write it as <j>(z) = (1 — <j>z)<l)p-i(z) for a polynomial 4> P -i 
of degree p — 1 and a complex number cf> with \(j>\ < 1. The series Y t = (1 — (f>B)X t is 
bounded in probability and <f> p -i(B)Yt = Z t , whence Y t is a(Z t , Z t ~i, . . .)-measurable, 



See Bougerol (), Lemma 
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by the induction hypothesis. By iterating the relation X t = (j)X t -i + Y t , we find that 
X t = <j) n X t - n + X^?=o ftYt-j- Because the sequence X t is uniformly tight and (f> n — > 0, 
the sequence ^ n I t _ n converges to zero in probability. Hence X t is the limit in probability 
of a sequence that is <r(Y t , Y t -\, ■ ■ .)-measurable and hence is &(Z t , Z t ~i, ■ . .)-measurable. 
This implies the result. ■ 

8.12 EXERCISE. In the preceding lemma the function ip{z) = l/<f>(z) possesses a power 
series representation ip{z) = Y^jLo^j 2 ^ on a neighbourhood of the unit disc. Is it true 
under the conditions of the lemma that X t = Sjlo i>jZt-j, where the series converges 
(at least) in probability? 

8.13 Example. For the GARCH(1, 1) process the random matrices A t given by (8.7) 
reduce to the random variables <j>i + 6\Zf_ 1 . The top Lyapounov exponent of these 
random lxl matrices is equal to Elog((/>i + 6\Zf). This number can be written as 
an integral relative to the distribution of Z t , but in general is not easy to compute 
analytically. □ 

The proofs of the preceding theorems provide a recipe for generating a GARCH 
process starting from initial values. Given a centered and scaled i.i.d. process Z t and an 
arbitrary random vector Y of dimension p + q — 1, we define a process Y t through the 
recursions Y t = A t Y t _i + b, with the matrices A t given in (8.7) and the vector b equal to 
a times the first unit vector. Next we set a t = \fY^\ and X t = a t Z t for t > 1. Because 
the stationary solution to the GARCH equation is unique, the initial vector Y must 
be simulated from a "stationary distribution" in order to obtain a stationary GARCH 
process. However, the effect of a "nonstationary" initialization wears off as t — > oo and 
the process will approach stationarity, provided the coefficients in the GARCH equation 
are such that a stationary solution exists. This is true both for I/2-stationarity and strict 
stationarity, under the appropriate conditions on the coefficients. 

8.14 Theorem. Let a > 0, let cf>i , . . . , <j> p , 9\,...,9 q be nonnegative numbers, and let 
Z t be an i.i.d. process with mean zero and unit variance such that Z t is independent of 
T t -\ for every tgZ. 

(i) If%2j(<f>j +8j) < 1, then the difference X t — X t of any two solutions X t = a t Z t and 

X t = & t Z t of the GARCH equations that are square-integrable converges to zero in 

Li as t — > oo. 
(ii) If the top Lyapounov exponent of the matrices A t in (8.7) is negative, then the 

difference X t — X t of any two solutions X t = a t Z t and X t = d t Z t of the GARCH 

equations converges to zero in probability as t — > oo. 

Proof. From the two given GARCH processes X t and X t define processes Y t and Y t as 
indicated preceding the statement of Theorem 8.10. These processes satisfy (8.5) for the 
matrices A t given in (8.7). Choosing n = t — 1 and taking differences we see that 

Y t -Y t = A t A t _ 1 ---A 1 (Y -Y ). 
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If the top Lyapounov exponent of the matrices A t is negative, then the norm of the 
right side can be bounded, almost surely for sufficiently large t, by by e 7 t \\Y — Y \\ for 
some number 7' < 0. This follows from the subergodic theorem, as before (even though 
this time the matrix product grows on its left side). This converges to zero as t — > 00, 
implying that a t — a t — > almost surely as t — > 00. This in turn implies (ii). 

Under the condition of (i), the spectral radius of the matrix A = EA t is strictly 
smaller than 1 and hence \\A n \\ — > 0. By the nonnegativity of the entries of the matrices 
A t the absolute values of the coordinates of the vectors Y t — Y t are bounded above by the 
coordinates of the vector A t A t -i ■ ■ ■ AiZ , for Z the vector obtained by replacing the 
coordinates of Y — Y by their absolute values. By the independence of the matrices A t 
and vector Z , the expectation of A t A t -i ■ ■ ■ A\Z is bounded by A t EZ , which converges 
to zero. This implies that, as t — > oo, 

E|A 2 - A 2 | = E|a 2 - o\\Zl = E|a 2 - a 2 | -> 0. 

For the stationary solution X t the sequence (A 2 ) is uniformly integrable. By the preced- 
ing display this is then also true for X t , and hence also for a general X t . The sequence 
X t — X t is then uniformly square integrable as well. Combining this with the fact that 
X t — X t — > in probability, we see that X t — X t converges to zero in second mean. ■ 

The preceding theorem may seem at odds with a common interpretation of a sta- 
tionary and stability condition as a condition for "persistence". The condition for L^- 
stationarity of a GARCH process is stronger than the condition for strict stationarity, 
so that it appears as if we have found two different conditions for persistence. Whenever 
a strictly stationary solution exists, the influence of initial values wears off as time goes 
to infinity, and hence the initial values are not persistent. This is true independently of 
the validity of the condition X^(<Aj + ®j) < 1 f° r £2-stationarity. However, the latter 
condition, if it holds, does ensure that the process approaches stationarity in the stronger 
I/2-sense. 

The condition ~Y^A<j>j+Qj) < 1 is necessary for the strictly stationary solution to have 
finite second moments. By an appropriate initialization we can ensure that a GARCH 
process has finite second moments for every t, even if this condition fails. (It will then 
not be stationary.) However, in this case the variances EA" 2 must diverge to infinity as 
t — > oo. This follows by a Fatou type argument, because the process will approach the 
strictly stationary solution and this has infinite variance. 

8.15 EXERCISE. Suppose that the time series X t is strictly stationary with infinite 
second moments and X t — X t — > in probability as t — > oo. Show that EA 2 — > oo. 

We can make this more concrete by considering the prediction formula for the con- 
ditional variance process of. For the GARCH(1, 1) process we prove below that 

h-2 

(8.8) E(A 2 + ,J^-i) = E^Vil^t-i) = (0i +01)*- 1 *? +a£>i +0i)*. 

3=0 
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For fa + 6\ < 1 the first term on the far right converges to zero as h — > oo, indicating 
that information at the present time t does not help to predict the conditional variance 
process in the "infinite future" . On the other hand, if fa + 6\ > 1 and a > then both 
terms on the far right side contribute positively as h — > oo. If fa + 6\ = 1, then the 
relative contribution of the term (fa +#i)' l ~ 1 af tends to zero as h — > oo, whereas if 
fa+6\ > 1 the contributions are of the same order. In the last case the value of appears 
to be particularly "persistent" . 

The case that ^2 A fa + 0j) = 1 is often viewed as having particular interest and is 
referred to as integrated GARCH or IGARCH. Many financial time series yield GARCH 
fits that are close to IGARCH. 

A GARCH process, being a martingale difference, does not allow nontrivial predic- 
tions of its mean values. However, it is of interest to predict the conditional variances 
of, or equivalently the process of squares Xf. Predictions based on the infinite past T t 
can be obtained using the auto-regressive representation from the proof of Theorem 8.10. 
Let A t be the matrix given in (8.7) and let Y t = (of, . . . , of_ p+1 , X^_ lt . . . , X^_ q+1 ) T , so 
that Y t = A t Y t -i + b for every t. The vector Y t is Tt-\ -measurable, and the matrix A t 
depends on Z t ~i only, with A = E(A t \ Tt--i) independent of t. It follows that 

E(F t | Tt-i) = E(A t \ Ft-2)Y t -i + b = AY t ^ + b. 

By iterating this equation we find that, for h > 1, 

h-2 

E(F t | T t - h ) = A h - l Y t _ h+l + J2 A*b. 

3=0 

In the case of a GARCH(1, 1) process the vector Y t is equal to of and the matrix A 
reduces to the number fa + 6\, whence we obtain the equation (8.8). For a general 
GARCH(p, q) process the process of is the first coordinate of Y t , and the prediction 
equation takes a more involved form, but is still explicitly given in the preceding display. 
^ Y^j{4>3 + 8j) < 1> then the spectral radius of the matrix A is strictly smaller than 1, 
and both terms on the right converge to zero at an exponential rate, as h — > oo. In this 
case the potential of predicting the conditional variance process is limited to the very 
near future. 



8.2 Linear GARCH with Leverage and Power GARCH 

Fluctuations of foreign exchange rates tend to be symmetric, in view of the two-sided 
nature of the foreign exchange market. However, it is an empirical finding that for asset 
prices the current returns and future volatility are negatively correlated. For instance, a 
crash in the stock market will be followed by large volatility. 

A linear GARCH model is not able to capture this type of asymmetric relationship, 
because it models the volatility as a function of the squares of the past returns. One 
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attempt to allow for asymmetry is to replace the GARCH equation (8.1) by 

a \ = a + ^lCT t 2 _! + • • • + 4> p a 2 t _ p + 0l(|X t _l| + 7l^t-l) 2 + ■ ■ • + Oq{\Xt-q\ + IqXt-qf- 

This reduces to the ordinary GARCH equation if the leverage coefficients 7j are set 
equal to zero. If these coefficients are negative, then a positive deviation of the process 
X t contributes to lower volatility in the near future, and conversely. 

A power GARCH model is obtained by replacing the squares in the preceding display 
by other powers. 



Figure 8.3. The function x !->■ (\x\ + 7a;) 2 for 7 = —0.2. 



8.3 Exponential GARCH 

The exponential GARCH or EGARCH model is significantly different from the GARCH 
models described so far. It retains the basic set-up of a process of the form X t = a t Z t 
for a martingale difference sequence Z t satisfying (8.2) and an J r t _i-adapted process a t , 
but replaces the GARCH equation by 

logCT t 2 =a + ^ilogCT t 2 _ 1 +--- + ^ p logCT t 2 _ p +6li(|2 t _i|+7iZ t _i) + --- + 6l 9 (|Z t _ 9 |+7 9 Z t _ 9 ). 

Through the presence of both the variables Z t and their absolute values and the trans- 
formation to the logarithmic scale this can also capture the leverage effect. An advantage 
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of modelling the logarithm of the volatility is that the parameters of the model need not 
be restricted to be positive. 

Because the EG ARCH model specifies the log volatility directly in terms of the noise 
process Z t and its own past, its definition is less recursive than the ordinary GARCH 
definition, and easier to handle. In particular, for fixed and identical leverage coefficients 
7i = 7 the EG ARCH equation describes the log volatility process log of as a regular 
ARM A process driven by the noise process \Z t \ + ~/Z t , and we may use the theory for 
ARMA processes to study its properties. In particular, if the roots of the polynomial 
<j>{z) = 1 — (j>iz — ■ ■ ■ — (f> p z p are outside the unit circle, then there exists a stationary 
solution log of that is measurable relative to the a-field generated by the process Z t -\. 
If the process Z t is strictly stationary, then so is the stationary solution log of and so is 
the EG ARCH process X t = a t Z t . 



8.4 GARCH in Mean 

A GARCH process by its definition is a white noise process, and thus it could be a 
useful candidate to drive another process. For instance, an observed process Y t could be 
assumed to satisfy the ARMA equation 

4(B)Y t = 6(B)X t , 

for X t a GARCH process, relative to other polynomials cf> and 8 (which are unrelated to 
cf) and 0). One then says that Y t is "ARMA in the mean" and "GARCH in the variance", 
or that Y t is an ARMA-GARCH series. Results on ARMA processes that hold for any 
driving white noise process will clearly also hold in the present case, where the white 
noise process is a GARCH process. 

8.16 EXERCISE. Let X t be a stationary GARCH process relative to polynomials <j> and 
9 and let the time series Y t be the unique stationary solution to the equation <j){B)Y t = 
6(B)X t , for cf> and 6 polynomials that have all their roots outside the unit disc. Let T t be 
the filtration generated by Y t . Show that var(y t | Tt-i) = var(X t | X t -i, X t -2, ■ ■ ■) almost 
surely. 

It has been found useful to go a step further and let also the conditional variance of 
the driving GARCH process appear in the mean model for the process Y t . Thus given a 
GARCH process X t with conditional variance process of = var(A" t | Ft-i) it is assumed 
that Y t = f(a t ,X t ) for a fixed function /. The function / is assumed known up to a 
number of parameters. For instance, 

${B)Y t = i>a t + 6{B)X t , 

4(B)Y t = iPa*+8(B)X u 

4(B)Y t = Vloga t 2 +6(B)X t . 

These models are known as GARCH-in-mean, or GARCH-M models. 



9 

State Space Models 



t A causal AR(1) process with i.i.d. innovations Z t is a Markov process: the conditional 
distribution of the "future value" X t +i = <j>X t + Z t +i given the "past values" Xi,...,X t 
depends on the "present value" X t only. Specifically, the conditional density of X t +i is 
given by 

Px t ^\x u ...,x t { x ) =Pz(x - <f>X t ). 

(We use causality to ensure that Z t +i is independent of X\, . . . , X t .) The Markov struc- 
ture has an obvious practical interpretation and suggests a recursive algorithm to com- 
pute predictions. It also allows a simple factorization of the likelihood function. For 
instance, the likelihood for the causal AR(1) process in the previous paragraph can be 
written 

n 

Px lt ...,x.{X u . ..,X n ) = ]\p z (X t - <j>X t _ x )p Xl {X x ). 

t=2 

It would be of interest to have a similar property for more general time series. 

Some non-Markovian time series can be forced into Markov form by incorporating 
enough past information into a "present state". For instance, an AR(p) process with 
p > 2 is not Markov, because X t +i depends on p variables in the past. We can remedy 
this by defining a "present state" to consist of the vector X t : = (X t , . . . ,X t - p +i): the 
process X t is Markov. In general, to induce Markov structure we must define a state 
in such a way that it incorporates all relevant information for transition to the next 
state. This is of interest only if this is possible using "states" that are of not too high 
complexity. 

A hidden Markov model consists of a Markov chain, but rather than the state at time 
t we observe a transformation of it, up to noise which is independent of the Markov chain. 
A related structure is the state space model. Given an "initial state" X , "disturbances" 



' In view of the somewhat technical content, this chapter may contain more errors than others. Please be 
(more) critical. 
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V\, W\, V2, ■ ■ ■ and functions f t and g t , processes X t and Y t are denned recursively by 

, gn X t = f t (X t ^,V t ), 

[ ' ' Y t = g t (X u W t ). 

We refer to X t as the "state" at time t and to Y t as the "output" . The state process X t 
can be viewed as primary and evolving in the background, describing the consecutive 
states of a system in time. At each time t the system is measured, producing an output 
Y t . If the sequence X , V\, V2, ■ ■ ■ consists of independent variables, then the state process 
X t is a Markov chain. If the variables X ,V\, Wi,V2,W2,V3, .. . are independent, then 
for every t given the state X t the output Y t is conditionally independent of the states 

Xo, Xi, . . . and outputs Y\, . . . , Y t -i,Y t -2, Under this condition the state space model 

becomes a hidden Markov model. 

Typically, the state process X t is not observed, but instead at time t we only observe 
the output Y t . For this reason the process Y t is also referred to as the "measurement 
process" . The second equation in the display is called the "measurement equation" , while 
the first is the "state equation". Inference might be directed at estimating parameters 
attached to the functions f t or g t , to the distribution of the errors or to the initial 
state, and/or on predicting or reconstructing the states X t from the observed outputs 
Y]_,...,Y n . Predicting or reconstructing the state sequence is referred to as "filtering" or 
"smoothing" . 

For linear functions f t and g t and vector- valued states and outputs the state space 
model takes the form 

(9.2) X t = F t X t - i+ V t , 

Yt = GtX t +Wt. 

The matrices F t and G t are often taken independent of t. The analysis usually concerns 
linear predictions, and then a common assumption is that the vectors X , V\, W\, V2, ■ ■ ■ 
are uncorrelated. If F t is independent of t and the vectors V t form a white noise process, 
then the series X t is a VAR(l) process. 

Because state space models are easy to handle, it is of interest to represent a given 
observable time series Y t as the output of a state space model. This entails finding a state 
space and a state process X t , and a corresponding state space model with the given series 
Y t as output. It is particularly attractive to find a linear state space model. Such a state 
space representation is definitely not unique. An important issue in systems theory is to 
find a (linear) state space representation of minimum dimension. 

9.1 Example (State Space representation ARMA). Let It be a stationary, causal 
ARMA(r + 1, r) process satisfying (f>(B)Y t = 6(B)Z t for an i.i.d. process Z t . (The choice 
p = q + 1 can always be achieved by padding the set of coefficients of the polynomial 
<j) or 6 with zeros.) Then the AR(p) process X t = (l/(f>)(B)Z t is related to Y t through 
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Y t =6(B)X t . Thus 



Y t — (00, ■ ■ ■ ,6r) 



1 



x. 



\Xf. — T 



Xt-i 

\x t - T J 



T+l \ 







\ 



X t -2 



+ 



1 o / \x t T J 



(Z t \ 





This is a linear state space representation with state vector (X t , . . . , X t - T ), and matrices 
F t and G t , which are independent of t. Under causality the innovations V t = (Z t , 0, . . . , 0) 
are orthogonal to the past X t and Y t ; the innovations W t are defined to be zero. The 
state vectors are typically unobserved, except when 6 is of degree zero. (If the ARMA 
process is invertible and the coefficients of 6 are known, then they can be reconstructed 
from the infinite past.) 

In the present representation the state-dimension of the ARMA(p,q) process is r + 
1 = min(p, q + 1). By using a more complicated noise process it is possible to represent 
an ARMA(p, q) process in dimension max(p, q), but this difference appears not to be 
very important.''' □ 

9.2 Example (State Space representation ARIMA). Consider a time series Z t whose 
differences Y t = VZ t satisfy the linear state space model (9.2) for state sequence X t . 
Writing Z t =Y t + Z t -\ = G t X t + W t + Z t -i, we obtain that 



X t 
Z t -i 



G t -! 



Zt-2 



+ 



Yt 
Wt-i 



Zt = (G t 1) 



X t 
Zt-i 



+ W t . 



We conclude that the time series Z t possesses a linear state space representation, with 
states of one dimension higher than the states of the original series. 

A drawback of the preceding representation is that the error vectors (Vt,Wt-i,Wt) 
are not necessarily uncorrelated if the error vectors (Vt,W t ) in the system with outputs 
Y t have this property. In the case that Z t is an ARIMA(p, l,q) process, we may use the 
state representation of the preceding example for the ARMA(p, q) process Y t , which has 
errors W t = 0, and this disadvantage does not arise. Alternatively, we can avoid this 
problem by using another state space representation. For instance, we can write 



X t 
Z t 



Ft 
G t Ft 



Z t = (0 1) 





1 
x t 

Zt 



Xt-i 
Zt-i 



+ 



v t 

G t Vt + W t 



* See e.g. Brockwell and Davis, p469— 471. 
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This illustrates that there may be multiple possibilities to represent a time series as the 
output of a state space model. 

The preceding can be extended to general ARIMA(p, d, q) models. If Y t = (l — B) d Z t , 
then Z t = Y t — Ylj=i ( ■) ( — l) J '^t— j- If the process Y t can be represented as the output 
of a state space model with state vectors X t , then Z t can be represented as the output 
of a state space model with the extended states (X t , Z t -i, . . . , Z t -d), or, alternatively, 
(X t ,Z t , . . . ,Z t -d+i)- □ 

9.3 Example (Stochastic linear trend). A time series with a linear trend could be 
modelled as It = a + flt + W t for constants a and j3, and a stationary process W t (for 
instance an ARM A process). This restricts the nonstationary part of the time series to 
a deterministic component, which may be unrealistic. An alternative is the stochastic 
linear trend model described by 

A A _ fl 1\ (At-i 



, - , , , , + V 

B t J \0 1J \B t -i_' l 

Y t = A t + W t . 

The stochastic processes (A t ,B t ) and noise processes (V t , W t ) are unobserved. This state 
space model contains the deterministic linear trend model as the degenerate case where 
V t = 0, so that B t = B and A t = A + B t. 

The state equations imply that VA t = B t -\ + V ty i and VB t = V t ,2, for V t = 
(Vt,i, Vtfi) T ■ Taking differences on the output equation Y t = A t + W t twice, we find that 

v 2 y t = vBt-i + w t ,i + v 2 w t = v t ,2 + vy,,i + v 2 w t . 

If the process (V t ,W t ) is a white noise process, then the auto-correlation function of 
the process on the right vanishes for lags bigger than 2 (the polynomial V 2 = (1 — B) 2 
being of degree 2). Thus the right side is an MA (2) process, whence the process Y t is an 
ARIMA(0,2,2) process. □ 

9.4 Example (Structural models). Besides a trend we may suspect that a given time 
series shows a seasonal effect. One possible parametrization of a deterministic seasonal 
effect with S seasons is the function 

LS/2J 2ns 

ii-> V] 7 s cos(A s t) +<5 s sin(A s t), A s = -— , s = 1, . . . , |S/2J. 

By appropriate choice of the parameters 7 S and 6 S this function is able to adapt to 
any periodic function on the integers with period S. We could add this deterministic 
function to a given time series model in order to account for seasonality. Again it may 
not be realistic to require the seasonality a-priori to be deterministic. An alternative is 
to replace the fixed coefficients ("f s ,6 s ) by the time series defined by 

fls,t\ = fcos\ s sinA s \ f -y s , t -i\ +v 
\S S}t J \sinA s - cos A s J \ S Sit -i J s '*' 
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An observed time series may next have the form 

Y t = (l 1 ... 1) 



72,t 



+ Z t . 



\l'J 



Together these equations again constitute a linear state space model. 

A model with both a stochastic linear trend and a stochastic seasonal component is 
known as a "structural model" . □ 

9.5 EXERCISE. Consider the state space model with state equations 74 = jt-i cos A + 
6t-i sin A + 14, 1 and 5 t = 7*-i sin A — 5 t -i cos A + 14,2 and output equation Y t = 74 + W t . 
What does this model reduces to if 14 = 0? 

The show piece of state space modelling is the Kalman filter. This is an algorithm 
to compute linear predictions (for linear state space models) , under the assumption that 
the parameters of the system are known. Because the formulas for the predictors, which 
are functions of the parameters and the outputs, can in turn be used to set up estimating 
equations for the parameters, the Kalman filter is also important for statistical analysis. 
We start discussing parameter estimation in Chapter 10. 

The variables X t and Y t in a state space model will typically be random vectors. 
For two random vectors A and Y the covariance or "cross-covariance" is the matrix 
Cov(A, Y) = E(X - EX)(Y - EY) T . The random vectors X and Y are called "uncorre- 
cted" if Cov(A, Y) = 0. The linear span of a set of vectors is defined as the linear span 
of all their coordinates. Thus this is a space of (univariate) random variables, rather than 
random vectors! We shall also understand a projection operator II, which is a map on 
the space of random variables, to act coordinatewise on vectors: if X is a vector, then 
IIA is the vector consisting of the projections of the coordinates of A". As a vector-valued 
operator a projection II is still linear, in that n(FA + Y) = FIIA +Iiy, for any matrix 
F and random vectors X and Y. 



9.1 Kalman Filtering 

The Kalman filter is a recursive algorithm to compute best linear predictions of the 
states X\,X2,... given observations Y\,Y2,... in the linear state space model (9.2). 
In the simplest situation the vectors X , V\, W\, V2, W2, ■ ■ ■ are assumed uncorrelated. 
We shall first derive the filter under the more general assumption that the vectors 
Ao, (Vi,Wi), (V2, W2), ■ ■ . are uncorrelated, and in Section 9.1.3 we further relax this 
condition. The matrices F t and G t as well as the covariance matrices of the noise vari- 
ables (Vt,Wt) are assumed known. 

By applying (9.2) recursively, we see that the vector X t is contained in the linear 
span of the variables Xq,V\, . . .,V t . It is immediate from (9.2) that the vector Y t is 
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contained in the linear span of X t and W t . These facts are true for every t £ N. It follows 
that under our conditions the noise variables V t and W t are uncorrelated with all vectors 
X s and Y s with s < t. 

Let H be a given closed linear subspace of 1/2(0, U, P) that contains the constants, 
and for t > let II t be the orthogonal projection onto the space H t = H +\in {Y\ , . . . , Y t ). 
The space H may be viewed as our "knowledge" at time 0; it may be H = lin {1}. We 
assume that the noise vectors V\, W\, V-j, . . . are orthogonal to H . Combined with the 
preceding this shows that the vector (Vt, W t ) is orthogonal to the space H t _i, for every 
t > 1. 

The Kalman filter consists of the recursions 

/ n t _ix t _i \ m ( u t ^x t \ () ( u t x t \ 
■ -^ cov(n t -iX t _i) u Cov(iit-iXt) Cov(n t x t ) ->■■■ 

V Cov(Xt_i) J ~* V Cov(Xt) J ~* \ Cov(Xt) / 

Thus the Kalman filter alternates between "updating the current state", step (1), and 
"updating the prediction space", step (2). 

Step (1) is simple. Because V t _L H , Y\, . . . , Y t -i by assumption, we have II t _iT4 = 
0. Applying II t to the state equation X t = F t X t -i + V t we find that, by the linearity of 
a projection, 

n t - 1 x t = F t (n t - 1 x t - 1 ), 



cov(n t _iX t ) = F t cov(n t -iX t _i)F; 



T 
t j 



Cav(X t ) = Ft Cov(AVi)F t T + Cov(V t ). 
This gives a complete description of step (1) of the algorithm. 

Step (2) is more involved, but also comes down to simple matrix computations. The 
vector W t =Y t — Ii t -iYt is known as the innovation at time t, because it is the part of Y t 
that is not explainable at time t— 1. It is orthogonal to H t _i, and together with this space 
spans Hf It follows that H t can be orthogonally decomposed as H t = H t -i + lin W t 
and hence the projection onto H t is the sum of the projections onto the spaces H t -\ and 
lin W t . At the beginning of step (2) the vector W t is known, because we can write, using 
the measurement equation and the fact that II t _i W t = 0, 

(9.3) W t = Y t - Ut-xYt =Y t - GtUt-xXt = G t (X t - U t ^X t ) + W t . 
Applying this to projecting the variable X t , we find 

(9.4) U t X t = Ut-iXt + A t Wt, At = Cov(X t ,W t ) Cov(Wt)- 1 . 

The matrix A t is chosen such that A t W t is the projection of X t onto lin W t . Because 
W t -L X t -i the state equation equation yields Cov(X t ,W t ) = Cov(Vt,W t ). By the or- 
thogonality property of projections Cov(X t , X t — II t _iA" t ) = Cov(X t — II t _iA" t ). Com- 
bining this and the identity W t = G t (X t — II t _iA" t ) + W t from (9.3), we compute 

Cov(X u W t ) = Cov(X t - Ut-iX t )ti[ + Cov(V t , W t ), 
(9 5) Cov(W-t) = G t Cov(X t - n t -iX t )G t r + G t Cov(V u W t ) 

+ Cov(Wt,V t )Gj + Cov(W t ), 
Cov(X t - Ilt-iX t ) = Cov(Xt) - Cav(Ut-iX t ). 
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The matrix Cov(A" t — II t _iA" t ) is the prediction error matrix at time t—1 and the last 
equation follows by Pythagoras' rule. To complete the recursion of step (2) we compute 
from (9.4) 

(9.6) Cov(II t X t ) = Cov(n t _iX t ) + At Cov(W t )Af. 

Equations (9.4)-(9.6) give a complete description of step (2) of the Kalman recursion. 

Predictions of future values of the state variable follow easily from U t X t , because 
Ii t X t+ h = F t+ }JItXt+h-i for any h > 1. Given the predicted states, future outputs can 
be predicted from the measurement equation by Ii t Y t+ h = G t +hRtX t +h- 

The Kalman algorithm must be initialized in one of its two steps, for instance by 
providing U Xi and its covariance matrix, so that the recursion can start with a step 
of type (2). It is here where the choice of Ho plays a role. Choosing Ho = lin(l) gives 
predictions using Y\ , . . . , Y t as well as an intercept and requires that we know IIo Ai = 
EXi. It may also be desired that II t _iX t is the projection onto lin (l,Yt-i,Y t -2, ■ ■ ■) for 
a stationary extension of Y t into the past. Then we set U Xi equal to the projection of 
Xi onto H = lin(l,y ,y_i,...). 

9.1.1 Missing Observations 

A considerable attraction of the Kalman filter algorithm is the ease by which missing 
observations can be accomodated. This can be achieved by simply filling in the missing 
data points by "external" variables that are independent of the system. Suppose that 
(X t ,Y t ) follows the linear state space model (9.2) and that we observe a subset (Y t ) te T 
of the variables Y]_,...,Y n . We define a new set of matrices G| and noise variables W t * 
by 

G* = G t , W* = W U t e T, 

g; = o, w; = W t , t$ t, 

for random vectors W t that are independent of the vectors that are already in the system. 
The choice W t = is permitted. Next we set 

X t = F t X t - 1 +V t , 

Y* =G* t X t + W t *. 

The variables (X t ,Y*) follow a state space model with the same state vectors X t . For 
t £ T the outputs Y t * =Y t are identical to the outputs in the original system, while for 
t $ T the output is Y t * = W t , which is pure noise by assumption. Because the noise 
variables W t do not add to the prediction of the hidden state X t , best predictions of 
states based on the observed outputs (XtjteT or based on Y*,...,Y* are identical. We 
can compute the best predictions based on Y*,...,Y* by the Kalman recursions, but 
with the matrices G| and Cov(W t *) substituted for G t and Cov(Wt). Because the Y t * 
with t $ T do not appear in the projection formula, we can just as well set their "observed 
values" equal to zero in the computations. 
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* 9.1.2 Kalman Smoothing 

Besides in predicting future states or outputs we may be interested in reconstructing the 
complete state sequence X , Xi, . . . , X n from the outputs Y]_,...,Y n . The computation of 
P n X n is known as the filtering problem, and is step (2) of our description of the Kalman 
filter. The computation oiP n X t for t = 0, 1, . . . ,n — 1 is known as the smoothing problem. 
For a given t it can be achieved through the recursions, with W n as given in (9.3), 

n n x t \ i ru + iX t \ 

Cav(X t ,W n ) -> Cov(X u W n+1 ) , n = t,t + l,.... 

Cav(X t , X n - n„_iX„) J \ Cav(X t , X n+l - U n X n+1 ) J 

The initial value at n = t of the recursions and the covariance matrices Cov(W„) of the 
innovations W„ are given by (9.5)-(9.6), and hence can be assumed known. 

Because H n+ \ is the sum of the orthogonal spaces H n and lin W n+ \, we have, as in 
(9.4), 

U n+1 X t = U n X t + A t ,„ + i W n+1 , A t ,n+i = Cav(X t , W n+1 ) Cov0y n+ i) _1 . 

The recursion for the first coordinate U n X t follows from this and the recursions 
for the second and third coordinates, the covariance matrices Cov(X t ,W n +i) and 

Cov(x t ,x„+i — n n x n +i). 

Using in turn the state equation and equation (9.4), we find 

Cov(X t , X n+1 - n n X n+1 ) = Cav(X t , F n+1 (X n - U n X n ) + V n+l ) 

= Cov(X t ,F n+1 (X n - n„_iX„ + A n W n )). 

This readily gives the recursion for the third component, the matrix A n being known 
from (9.4)-(9.5). Next using equation (9.3), we find 

Cov(X u W n+1 ) = Cov(X u X n+1 - n n X n+l )G T n+1 . 

* 9.1.3 Lagged Correlations 

In the preceding we have assumed that the vectors X , (Vi,Wi), (V2, W2), ■ ■ ■ are uncor- 
rected. An alternative assumption is that the vectors Xq,V\, (Wi,!^), (V^,^), . . . are 
uncorrelated. (The awkward pairing of W t and V t +i can be avoided by writing the state 
equation as X t = F t X t -i + V t -i and next making the assumption as before.) Under 
this condition the Kalman filter takes a slightly different form, where for economy of 
computation it can be useful to combine the steps (1) and (2). 
Both possibilities are covered by the assumptions that 

- the vectors X , V\, V2, ■ ■ ■ are orthogonal. 

- the vectors W\, W2, ■ ■ ■ are orthogonal. 

- the vectors V s and W t are orthogonal for all (s, t) except possibly s =t or s = t + 1. 

- all vectors are orthogonal to H . 

Under these assumptions step (2) of the Kalman filter remains valid as described. Step 
(1) must be adapted, because it is no longer true that Ilt-iVt = 0. 
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Because V t _L H t - 2 , we can compute Ilt-iVt from the innovation decomposition 
H t -i = H t -2 + lin W t -i, as Ht-iVt = K t -iW t -i for the matrix 

K t -x = Cav(V t , W t -i) CovCWi-i) -1 . 

Note here that Cov(Vj, Wj-i) = Cov(Vj, Wt-i), in view of (9.3). We replace the calcula- 
tions for step (1) by 

nt-iXt = Ft(n t -iX t _i) + iftWi-i, 

Cov(n t _iX t ) = F t Cov(n t _iX t _i)F t T + K t Cav(W t -!)K?, 
Cav(X t ) = F t Cov(X t _i)F t T + Cov(V t ). 

This gives a complete description of step (1) of the algorithm, under the assumption that 
the vector W t -i, and its covariance matrix are kept in memory after the preceding step 
(2). 

The smoothing algorithm goes through as stated except for the recursion for the 
matrices Cov(X t ,X n — H n -iX n ). Because U n V n +i may be nonzero, this becomes 

Cov(X u X n+1 - U n X n+1 ) = Cov(X u X n - U n ^X n )F^ +1 + Cov(X t ,# 7l )A^Fj +1 

+ Cov(X u W n )KZ. 



9.2 Nonlinear Filtering 

The simplicity of the Kalman filter results both from the simplicity of the linear state 
space model and the fact that it concerns linear predictions. The principle of recursive 
predictions can be applied more generally to compute nonlinear predictions in nonlinear 
state space models, provided the conditional densities of the variables in the system are 
available and certain integrals involving these densities can be evaluated, analytically, 
numerically, or by stochastic simulation. 

Somewhat abusing notation we write a conditional density of a variable X given an- 
other variable Y as p(x\ y), and a marginal density of X as p(x). Consider the nonlinear 
state space model (9.1), where we assume that the vectors Xq, V\, W\, V2, ■ ■ ■ are indepen- 
dent. Then the outputs Yi,...,Y n are conditionally independent given the state sequence 
Xq,X\,.. . ,X n , and the conditional law of a single output Y t given the state sequence 
depends on X t only. In principle the (conditional) densities p(xo),p(xi\xo),p(x2\xi), ■ ■ ■ 
and the conditional densities p(yt\x t ) of the outputs are available from the form of the 
functions f t and g t and the distributions of the noise variables (V t ,W t ). Under the as- 
sumption of independent noise vectors the system is a hidden Markov model, and the 
joint density of states up till time n + 1 and outputs up till time n can be expressed in 
these densities as 

(9.7) p(x )p(x 1 \ xq) ■ ■ ■ p(x n +i\ x n )p(yi\ xi)p(y 2 \ x 2 ) ■ ■ -p(y n \ x n ). 
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The marginal density of the outputs is obtained by integrating this function relative 
to {xq, . . . ,x n+ i). The conditional density of the state sequence (X ,...X n+ i) given 
the outputs is proportional to the function in the display, the norming constant being 
the marginal density of the outputs. In principle, this allows the computation of all 
conditional expectations E(X t \ Y\, . . . , Y n ). However, because this is a quotient of n+ 1- 
dimensional integrals and n may be large, this is unattractive unless the integrals can be 
evaluated easily. 

An alternative for finding predictions is a recursive scheme for calculating conditional 
densities, of the form 

> p(xt-i\yt-i, ■ ■ ■ ,yi) -*p(xt\yt-i, ■ ■ ■ ,yi)-*p(x t \y t , ■ ■ ■ ,yi) ->■■-. 

This is completely analogous to the updates of the linear Kalman filter: the recursions 
alternate between "updating the state", (1), and "updating the prediction space", (2). 
Step (1) can be summarized by the formula 

p(x t \y t -i,- ■ ■ ,yi) = / p(xt\x t -i,yt-i, ■ ■ .,yi)p(x t -i\y t -i,- --,2/1) dfjt t {x t -i) 

p(x t \ x t -i)p(x t -i\ y t -i, ■ ■ ■ , yi) d[it(x t -i). 



!• 



The second equality follows from the conditional independence of the vectors X t and 
Y t -i,...,Yi given X t -\. This is a consequence of the form of X t = ft(X t -i,V t ) 
and the independence of V t and the vectors X t -i,Y t -i, . . . , Y\ (which are functions of 
X ,V 1 ,...,V t - 1 ,W 1 ,...,W t - 1 ). 

To obtain a recursion for step (2) we apply Bayes formula to the conditional densities 
given Y(_i , . . . , Yi to obtain 

, 1 x p{vt\x t ,yt-i,---,yi)p(xt\yt-i,---,yi) 

p(xt\yt,---,yi) ~ 



Jp(yt\x t , y t -i, ■ ■ ■ , yi)p(x t \yt-i, ■ ■ ■ , 2/1) dfi t (x t ) 
= p(yt\xt)p(x t \y t -i,...,yi) 
p(vt\yt-i,---,yi) 

The second equation is a consequence of the fact that Y t = g t (X t ,W t ) is conditionally 
independent of Yi_i, . . . , Yi given X t . The conditional density p(yt\ yt-i, ■ ■ ■ , j/i) in the 
denominator is a nuisance, because it will rarely be available explicitly, but acts only as 
a norming constant. 

The preceding formulas are useful only if the integrals can be evaluated. If analytical 
evaluation is impossible, then perhaps numerical methods or stochastic simulation could 
be of help. 

If stochastic simulation is the method of choice, then it may be attractive to apply 
Markov Chain Monte Carlo for direct evaluation of the joint law, without recursions. The 
idea is to simulate a sample from the conditional density p(xo, ■ ■ ■ , x„+i| j/i, . . . ,y n ) of the 
states given the outputs. The biggest challenge is the dimensionality of this conditional 
density. The Gibbs sampler overcomes this by simulating recursively from the marginal 
conditional densities p(x t \x- t ,yi, ■ ■ ■ ,y n ) of the single variables X t given the outputs 
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Yi,...,Y n and the vectors X- t = (Xo, ■ ■ ■ , X t -i,X t +i, . . . , X n+ i) of remaining states. 
We refer to the literature for general discussion of the Gibbs sampler, but shall show 
that these marginal distributions are relatively easy to obtain for the general state space 
model (9.1). 

Under independence of the vectors X , V\, W\, V2, ■ ■ ■ the joint density of states and 
outputs takes the hidden Markov form (9.7). The conditional density of X t given the 
other vectors is proportional to this expression viewed as function of x t only. Only three 
terms of the product depend on x t and hence we find 

p(x t \x-t, yi,---,y n )~ p(xt+i\xt)p(xt\x t -i)p(yt\x t ). 

The norming constant is a function of the conditioning variables X- t , y\ , . . . , y n only and 
can be recovered from the fact that the left side is a probability density as a function 
of xt- A closer look will reveal that it is equal to p(y t \ x t , x t +i)p(x t +i\ x t -i). However, 
many simulation methods, in particular the popular Metropolis-Hastings algorithm, can 
be implemented without an explicit expression for the proportionality constant. The 
forms of the three densities on the right side should follow from the specification of the 
system. 

The assumption that the variables X , Vi,W2, V2, ■ ■ ■ are independent may be too 
restrictive, although it is natural to try and construct the state variables so that it is 
satisfied. Somewhat more complicated formulas can be obtained under more general 
assumptions. Assumptions that are in the spirit of the preceding derivations in this 
chapter are: 
(i) the vectors Xo, X\, X2, ■ ■ ■ form a Markov chain. 

(ii) the vectors Yi,...,Y n are conditionally independent given Xq,X\,..., X n+ \ . 
(iii) for each t £ {l,...,n} the vector Y t is conditionally independent of the vector 

(X , ■ ■ ■ ,X t -2,X t +2, ■ ■ ■ , X n+ {) given (X t -i,X t ,X t +i). 
The first assumption is true if the vectors X ,Vi, V2, . . . are independent. The second 
and third assumptions are certainly satisfied if all noise vectors X , V\, W\, V2, W2, V3, . . . 
are independent. The exercises below give more general sufficient conditions for (i)-(iii) 
in terms of the noise variables. 

In comparison to the hidden Markov situation considered previously not much 
changes. The joint density of states and outputs can be written in a product form similar 
to (9.7), the difference being that each conditional density p(yt\x t ) must be replaced by 
p(yt\ Xt-i, x t , x t +i). The variable x t then occurs in five terms of the product and hence 
we obtain 

p(x t \x-t,yi, ■ ■ ■ ,y n ) - p(x t +i I x t )p(x t \ x t -i) x 

x p(yt-i I x t -2 , x t -i , x t )p(y t I x t -i , x t , x t +i )p(yt+i I x t , x t +i , x t+2 ) . 

This formula is general enough to cover the case of the ARV model discussed in the next 
section. 

9.6 EXERCISE. Suppose that X ,Vi, W\, V2, ^2,^3, . . . are independent, and define 
states X t and outputs Y t by (9.1). Show that (i)-(iii) hold, where in (iii) the vector 
Y t is even conditionally independent of (X s :s ^ t) given X t . 



* 
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9.7 EXERCISE. Suppose that Xq,V\,V2,...,Zi,Z2,... are independent, and define 
states X t and outputs Y t through (9.2) with W t = h t (Vt,Vt+i,Z t ) for measurable func- 
tions h t . Show that (i)-(iii) hold. [Under (9.2) there exists a measurable bijection be- 
tween the vectors (X ,Vi, . . . , V t ) and (X ,Xi,. . .,X n ), and also between the vectors 
(X t ,X t -i,X t+ i) and (X t , V t , Vj+i). Thus conditioning on (X ,Xi, . . . ,X n+ i) is the same 
as conditioning on (X , Vi , . . . , V n+ i) or on (X , Vi , . . . , V n , X t _i , X t , X t+1 ) .] 

9.8 EXERCISE. Show that the condition in the preceding exercise that W t = 
ht(V t , V t +i,Z t ) for Z t independent of the other variables is equivalent to the conditional 
independence of W t and X , V\, . . . , V„, W s :s ^ t given Vt,V t +i. 



9.3 Stochastic Volatility Models 

The term "volatility" , which we have used at multiple occasions to describe the "mov- 
ability" of a time series, appears to have its origins in the theory of option pricing. The 
Black-Scholes model for pricing an option on a given asset with price S t is based on a 
diffusion equation of the type 

dS t = n t S t dt + a t S t dB t . 

Here B t is a Brownian motion process and n t and a t are stochastic processes, which 
are usually assumed to be adapted to the filtration generated by the process S t . In the 
original Black-Scholes model the process a t is assumed constant, and the constant is 
known as the "volatility" of the process S t . 

The Black-Scholes diffusion equation can also be written in the form 

S t /"* /"* 

log— = / /i s ds+ a s dB s . 

So Jo Jo 

If n and a are deterministic processes this shows that the log returns \ogS t /S t -i over 
the intervals (t — 1, t] are independent, normally distributed variables (t = 1,2,.. .) with 
means J t _ 1 fi s ds and variances J t _ 1 a^ ds. In other words, if these means and variances 
are denoted by ~p t and ~o\ , then the variables 

_ log5 t /5 t _i -Ji t 

A — — 

o- t 

are an i.i.d. sample from the standard normal distribution. The standard deviation a t 
can be viewed as an "average volatility" over the interval (t — l,t\. If the processes [it 
and <7 t are not deterministic, then the process Z t is not necessarily Gaussian. However, if 
the unit of time is small, so that the intervals (t— l,t] correspond to small time intervals 
in real time, then it is considered still believable that the variables Z t are approximately 
normally distributed. In that case it is also believable that the processes n t and o~ t are 
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approximately constant and hence these processes can replace the averages JI t and W t . 
Usually, one even assumes that the process fi t is constant in time. For simplicity of 
notation we shall take it to be zero in the following, leading to a model of the form 
log St/St-i = crtZt, for standard normal variables Z t and a "volatility" process a t . 

There is ample empirical evidence that models with constant volatility do not fit 
observed financial time series. In particular, this has been documented through a compar- 
ison of the option prices predicted by the Black-Scholes formula to the observed prices 
on the option market. Because the Black-Scholes price of an option on a given asset 
depends only on the volatility parameter of the asset price process, a single parameter 
volatility model would allow to calculate this parameter from the observed price of an 
option on this asset, by inversion of the Black-Scholes formula. Given a range of options 
with different maturities (and/or different strike prices) written on a given asset this in- 
version process usually leads to a range of "volatilities" , all connected to the same asset 
price process. These implied volatilities usually vary with the maturity and strike price. 

This discrepancy could be taken as proof of the failure of the reasoning behind the 
Black-Scholes formula, but the more common explanation is that "volatility" is a random 
process itself. One possible model is a diffusion equation of the type 

da t = X t <7 t dt + 7 t <7 t dW t , 

where W t is another Brownian motion process. This leads to a "stochastic volatility 
model in continuous time". Many different parametric forms for the processes A t and 
7t are suggested in the literature. One particular choice is to assume that log<7 t is an 
Ornstein-Uhlenbeck process, i.e. it satisfies 

d log a t = A(£ - log a t ) dt + 7 dW t . 

(An application of Ito's formula show that this corresponds to the choices A t = \^ 2 + 
A(£ — logCT t ) and 74 = 7.) The Brownian motions B t and W t are often assumed to be 
dependent, with quadratic variation d(B, W) t = St for some parameter 5 < 0. 

A diffusion equation is a stochastic differential equation in continuous time, and does 
not fit well into our basic set-up, which considers the time variable t to be integer-valued. 
One approach would be to use continuous time models, but assume that the continuous 
time processes are observed only at a grid of time points. In view of the importance 
of the option-pricing paradigm in finance it has been also useful to give a definition of 
"volatility" directly through discrete time models. These models are usually motivated 
by an analogy with the continuous time set-up. "Stochastic volatility models" in discrete 
time are specifically meant to parallel continuous time diffusion models. 

The most popular stochastic volatility model in discrete time is the auto-regressive 
random variance model or ARV model. A discrete time analogue for the Ornstein- 
Uhlenbeck type volatility process a t is the specification 

(9.8) logCT t = a + (/>log<7 t _i + V t -i. 

For \cf)\ < 1 and a white noise process V t this auto-regressive equation possesses a causal 
stationary solution log<7 t . We select this solution in the following. The observed log 
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return process X t is modelled as 

(9.9) X t = o t Zt, 

where it is assumed that the time series (V t ,Z t ) is i.i.d., thus allowing for dependence 
between V t and Z t . The volatility process a t is not observed. 

A dependence between V t and Z t allows for a leverage effect, one of the "stylized 
facts" of financial time series. In particular, if V t and Z t are negatively correlated, then 
a small return X t , which is indicative of a small value of Z t , suggests a large value of V t , 
and hence a large value of the log volatility log<7 t+ i at the next time instant. (Note that 
the time index t — 1 of V t -i in the auto-regressive equation (9.8) is unusual, because in 
other situations we would have written V t .) 

An ARV stochastic volatility process is a nonlinear state space model. It induces a 
linear state space model for the log volatilities and log absolute log returns of the form 



loga t = (a ^)( log i t _ 1 )+^-i 

log \X t \ = logCT t + log \Z t \. 



In order to take the logarithm of the observed series X t it was necessary to take the 
absolute value \X t \ first. Usually this is not a serious loss of information, because the 
sign of X t is equal to the sign of Z t , and this is a Bernoulli | series if Z t is symmetrically 
distributed. 

The linear state space form allows the application of the Kalman filter to compute 
best linear projections of the unobserved log volatilities log<r t based on the observed log 
absolute log returns log|A" t |. Although this approach is computationally attractive, a 
disadvantage is that the best predictions of the volatilities a t based on the log returns 
X t may be much better than the exponentials of the best linear predictions of the log 
volatilities log<r t based on the log returns. Forcing the model in linear form is not entirely 
natural here. However, the computation of best nonlinear predictions is involved. Markov 
Chain Monte Carlo methods are perhaps the most promising technique, but are highly 
computer-intensive . 

An ARV process X t is a martingale difference series relative to its natural filtration 
T t = o{Xt, X t -i, . . .). To see this we first note that by causality a t £ a(Vt-i , V t -2, ■ ■ ■), 
whence T t is contained in the filtration Q t = a(V s ,Z s : s <t). The process X t is actually 
already a martingale difference relative to this bigger filtration, because by the assumed 
independence of Z t from Q t -i 

E(X t |ft_ 1 )=a t E(Z t |ft_ 1 )=0. 

A fortiori the process X t is a martingale difference series relative to the filtration T t . 

There is no correspondingly simple expression for the conditional variance process 
E(A" t 2 | Tt-i) of an ARV series. By the same argument 

E(Xf\g t - 1 )=a 2 t EZl 

If EZ 2 = 1 it follows that E(X 2 | T t -i) = E(ct 2 | T t -i), but this is intractable for further 
evaluation. In particular, the process of is not the conditional variance process, unlike 
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in the situation of a GARCH process. Correspondingly, in the present context, in which 
a t is considered the "volatility" , the volatility and conditional variance processes do not 
coincide. 

9.9 EXERCISE. One definition of a volatility process a t of a time series X t is a process 
<7 t such that X t /a t is an i.i.d. standard normal series. Suppose that X t = o t Z t is a 
GARCH process with conditional variance process ct| and driven by an i.i.d. process Z t . 
If Z t is standard normal, show that a t qualifies as a volatility process. [Trivial.] If Z t 
is a fp-process show that there exists a process 5 2 with a chisquare distribution with p 
degrees of freedom such that ^/pat/St qualifies as a volatility process. 

9.10 EXERCISE. In the ARV model is a t measurable relative to the a- field generated 
by X t -i, X t -2, ■ ■ •? Compare with GARCH models. 

In view of the analogy with continuous time diffusion processes the assumption 
that the variables (V t , Z t ) in (9.8)-(9.9) are normally distributed could be natural. This 
assumption certainly helps to compute moments of the series. The stationary solution 
loga t of the auto-regressive equation (9.8) is given by (for \(j>\ < 1) 



log a t = £ $ (V t _i_j + a) = £ 4? V t _i_j + - 



a 



3=0 3=0 

If the time series V t is i.i.d. Gaussian with mean zero and variance ct 2 , then it follows that 
the variable log a t is normally distributed with mean q/(1 — <j>) and variance ct 2 /(1 — (f> 2 ). 
The Laplace transform Eexp(ai?) of a standard normal variable Z is given by exp(|a 2 ). 
Therefore, under the normality assumption on the process V t it is straightforward to 
compute that, for p > 0, 

2 2 

E\X t \ p = Ee plos ^E|Z t | p = exp(i^-^ + ^E^jE\Z t \ p . 

Consequently, the kurtosis of the variables X t can be computed to be 

K i (X)=e 4a2/ ^- 4 ' 2) K i (Z). 

If follows that the time series X t possesses a larger kurtosis than the series Z t . This is 
true even for cf> = 0, but the effect is more pronounced for values of <j> that are close 
to 1, which are commonly found in practice. Thus the ARV model is able to explain 
leptokurtic tails of an observed time series. 

For the computation of the auto-correlation function of the squared series X 2 we 
assume that the variables (V t ,Z t ) are bivariate normally distributed with correlation 6. 
Then the vectors (logCT t ,logCT t+ ^,Z t ) possess a three-dimensional normal distribution 
with covariance matrix 

„= / H 2 H 2 4." \ 
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Some calculations show that the auto-correlation function of the square process is given 

by 

, , (1 + 4<S 2 aV h_2 )e 4a5 '* fc/(1 -* a) - 1 
Px<h) = ——— _ , h >0. 

The auto-correlation is positive at positive lags and decreases exponentially fast to zero, 
with a rate depending on the proximity of <j> to 1. For values of <j> close to 1, the decrease 
is relatively slow. 

9.11 EXERCISE. Derive the formula for the auto-correlation function. 

9.12 EXERCISE. Suppose that the variables V t and Z t are independent for every t, in 
addition to independence of the vectors (V t ,Z t ), and assume that the variables V t (but 
not necessarily the variables Z t ) are normally distributed. Show that 

e 4<rV/(i-0 2 ) _ i 
Px2{h)= K4 (Z)e^/(^)-l > h> °- 

[Factorize Ea* +h a? Z? +h Z? as Ea t 2 +h <7?E^ +h Z t 2 .] 

The choice of the logarithmic function in the auto-regressive equation (9.8) has some 
arbitrariness, and other possibilities, such as a power function, have been explored. 
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Moment and 

Least Squares Estimators 



Suppose that we observe realizations Xi,...,X n from a time series X t whose distri- 
bution is (partly) described by a parameter 6 £ M. d . For instance, an ARMA process 
with the parameter (fa, . . .,(j> p ,6i, . . . ,6 q ,a 2 ), or a GARCH process with parameter 
(a,fa,...,fa,9\,...,G q ), both ranging over a subset of £ M p+9+1 . In this chapter we dis- 
cuss two methods of estimation of the parameters, based on the observations X\ , . . . , X n : 
the "method of moments" and the "least squares method" . 

When applied in the standard form to auto- regressive processes, the two methods 
are essentially the same, but for other models the two methods may yield quite different 
estimators. Depending on the moments used and the underlying model, least squares 
estimators can be more efficient, although sometimes they are not usable at all. The 
"generalized method of moments" tries to bridge the efficiency gap, by increasing the 
number of moments employed. 

Moment and least squares estimators are popular in time series analysis, but in gen- 
eral they are less efficient than maximum likelihood and Bayes estimators. The difference 
in efficiency depends on the model and the true distribution of the time series. Maxi- 
mum likelihood estimation using a Gaussian model can be viewed as an extension of the 
method of least squares. We discuss the method of maximum likelihood in Chapter 12. 



10.1 Yule- Walker Estimators 

Suppose that the time series X t — \x is a stationary auto-regressive process of known 
order p and with unknown parameters fa , . . . , fa and a 1 . The mean /j, = EX t of the 
series may also be unknown, but we assume that it is estimated by X n and concentrate 
our attention on estimating the remaining parameters. 

From Chapter 7 we know that the parameters of an auto-regressive process are not 
uniquely determined by the series X t , but can be replaced by others if the white noise 
process is changed appropriately as well. We shall aim at estimating the parameter under 
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the assumption that the series is causal. This is equivalent to requiring that all roots of 
the polynomial <j>(z) = 1 — faz — ■ ■ ■ — (f> p z p are outside the unit circle. 

Under causality the best linear predictor of X p+ i based on 1, X p , ... ,X\ is given by 

U p X p+ i = p + (f>i (X p — p) H h 4>p(X\ — p). (See Section 7.4.) Alternatively, the best 

linear predictor can be obtained by solving the general prediction equations (2.1). This 
shows that the parameters cf>i , . . . , cf) p satisfy 



/ 7x(0) 7x(l) 

7x(l) 7x(0) 

\7x(p-l) 7x(p-2) 



7x(p-i)\ /&\ /-rx(i)\ 

7x(p-2) ^2 7x(2) 



7x 



(0) / KtJ \7x(p)' 



We abbreviate this system of equations by T p <j)p = 7* p . These equations, which are known 
as the Yule- Walker equations, express the parameters into second moments of the obser- 
vations. The Yule-Walker estimators are defined by replacing the true auto-covariances 



lx{h) by their sample versions %(h) and next solving for 
estimators 



This leads to the 



/M 



\k) 



( 7n(0) 
7n(l) 



7n(l) 
7n(0) 



7n(p-l)\ 
7n(p-2) 



\7n(p-l) 7^(^-2) 



7.(0) 



/ 



/7n(l)\ 

7.(2) 

\7n(p)/ 



-:l p 7p- 



The parameter ct 2 is by definition the variance of Z p+ i, which is the prediction error 
X p+ i — UpX p+ i when predicting X p+ i by the preceding observations, under the assump- 
tion that the time series is causal. By the orthogonality of the prediction error and the 
predictor UpX p+ i and Pythagoras' rule, 

(10.1) a 2 = E(X P+1 - nf - E(II P X P+1 - n) 2 = 7* (0) - $T P $ P . 

We define an estimator a 2 by replacing all unknowns by their moment estimators, i.e. 

a 2 =%(0)-$pfp$p. 

10.1 EXERCISE. An alternative method to derive the Yule- Walker equations is to work 
out the equations cav(<f>(B)(X t - fj),X t - k - fi) = cov(Z t ,J2j >0 1 Pj z t-j-k) for k = 
0, . . . ,p. Check this. Do you need causality? What if the time series would not be causal? 

10.2 EXERCISE. Show that the matrix T p is invertible for every p. [Write a T T p a in 
terms of the spectral density] 

Another reasonable method to find estimators is to start from the fact that the true 
values of <j>i,...,<j>p minimize the expectation 



(ft, ■ • . ,0 P ) -> E(X t - p - ft(X t _i -p) P P {X t _ p - p)f. 



10.1: Yule- Walker Estimators 
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The least squares estimators are defined by replacing this criterion function by an "em- 
pirical" (i.e. observable) version of it and next minimize this. Let </>i, . . . , <j> p minimizing 
the function 



(A, ■ ■ ■ ,&) -> \ Y. (x t -x n - ft(Xt_i -x n ) p P (x t - p - x n )) 2 . 

t= P +i 

The minimum value itself is a reasonable estimator of the minimum value of the expec- 
tation of this criterion function, which is EZ 2 = a 2 . The least squares estimators <f)j 
obtained in this way are not identical to the Yule- Walker estimators, but the difference 
is small. To see this, we rewrite the least squares estimators as the solution of a system 
of equations. We begin by viewing them as the "ordinary" least squares estimators in 
the linear regression model 



/ X n — X n \ 
X n -\ — X T 



\x 



p+1 



x n J 



X„--\ — X 7 

Xr 



/-A-n-1 
X n -2 



\ X p — X r 



X n -2 — Xj 
X n -3 — Xj 

Xp-\ — X T 



X n - P — X 

-A-n—p—1 -<*-7 



[n \ 



X-\ — X, 



J 



\p p ' 



+ e. 



This model differs from the ordinary linear regression model in that the design matrix, 
the (n — p) x p-matrix on the right, is random and depends on the dependent variable. 
However, the linear algebra to find the least squares estimators /?i, . . . ,/? p is the same. 
If we abbreviate the preceding display as X n — X n = D n f3 p + e, then the least squares 
estimators are given by 



(\D T n D n ) l ±D T n (X n -X n ). 



At closer inspection this vector is nearly identical to the Yule- Walker estimators. Indeed, 
for every s,t, 



1 1 n 

(-DlD n ) = - V {Xj-„ - X n )(Xj- t - X n ) « %(s - t), 
\n / s,t n .*— ', 

j=p+i 

i i ™ 

(-Dl{X n - X n j) = - Y, ( x i-t ~ X n ){X j - X n ) « %(t). 

^ ft * t i v 

j=p+i 

Asymptotically the difference between the Yule- Walker and least squares estimators is 
negligible. They possess the same (normal) limit distribution. 

10.3 Theorem. Let (X t — n) be a causal AR(p) process relative to an i.i.d. sequence Z t 
with finite fourth moments. Then both the Yule-Walker and the least squares estimators 
satisfy, with F p the covariance matrix of (Xi, . . . , X p ), 



vWp-^^w^iv 1 ). 



144 



10: Moment and Least Squares Estimators 



Proof. We can assume without loss of generality that \x = 0. The AR equations 
4>(B)X t = Z t for i = n, n — 1, . . . ,p + 1 can be written in the matrix form 



X n -i 



\ X p+ i / 



/ X n _i X n _2 
X n -2 X n -3 



n-p \ / 01 \ / An \ 



V x„ 



x 



p-i 



X 

■^n—p—1 

Xl / 



+ 



\4>J 



Z n -1 

\ Z p+ i / 



D n <f>p + Z n , 



for Z n the vector with coordinates Z t + X n J2 fa. We can solve fa from this as 

fa = (D T n D n y l Dl(X n - Z n ). 

Combining this with the analogous representation of the least squares estimators fa we 
find 

vW P - 4) = Ql D nD n ) X ±=D T n {Z n - \X n {\ - £>))■ 

i 

Because X t is an auto-regressive process, it possesses a representation X t = V • tpjZ t -j 
for a sequence ipj with V \ipj\ < oo. Therefore, the results of Chapter 7 apply and show 

that n^ 1 D^D n E± F p . In view of Slutsky's lemma it now suffices to show that 



1 



--KZn~> 



N(0,a 2 T p ), 



1 



y/n y/n 

A typical coordinate of the last vector is (t = l,...,p) 



D^lX n ^0. 



~i= / , (Xj-t — x n )x n — —= y Xj-t x n — _ 

j= P +i v j=p+i 



n-p^i 



In view of Theorem 4.5 and the assumption that n = 0, the sequence y/nX n converges 
in distribution and hence both terms on the right side are of the order Op(l/y/n). 
A typical coordinate of the first vector is (t = 1, . . . ,p) 



^ i v*-* - x ^ = v^?^ e Y > + °^'^ 



j=p+i 



n y/n-p 4-f 
v j=i 



for Yj = X p _ t+ jZ p+ j. By causality of the series X t we have Z p+ j ± X p _ s+ j for s > 
and hence EY, = EX p _ s+ jEZ p+ j = for every j. The same type of arguments as in 
Chapter 5 will give us the asymptotic normality of the sequence y/nY n , with asymptotic 
variance 

oo oo 

2 , tWs) = 2^ Ex p - t+g z p+g x p - t z p . 

5= — oo g= — oo 

In this series all terms with g > vanish because, by the assumption of causality and the 
fact that Z t is an i.i.d. sequence, Z p+g is independent of (X p - t+g , X p - t , Z p ). All terms 
with g < vanish by symmetry. Thus the series is equal to 7y(0) = EX p _ t Z p = 7x(0)o" 2 , 



* 
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which is the diagonal element of a 2 Y p . This concludes the proof of the convergence in 
distribution of all marginals of n^ 1 / 2 Dl^Z n . The joint convergence is proved in a similar 
way, using the Cramer- Wold device. 

This concludes the proof of the asymptotic normality of the least squares estima- 
tors. The Yule- Walker estimators can be proved to be asymptotically equivalent to the 
least squares estimators, in that the difference is of the order op{l/\/n). Next we apply 
Slutsky's lemma. ■ 

10.4 EXERCISE. Show that the time series Y t in the preceding proof is strictly station- 
ary. 

10.5 EXERCISE. Give a precise proof of the asymptotic normality of y/nY '„ as defined 
in the preceding proof. 



10.1.1 Order Selection 

In the preceding derivation of the least squares and Yule- Walker estimators the order p 
of the AR process is assumed known a-priori. Theorem 10.3 is false if X t — \x were in 
reality an AR (po) process of order po > P- In that case fa , . . . fa are estimators of the 
coefficients of the best linear predictor based on p observations, but need not converge 
to the po coefficients fa, . . . , fa . On the other hand, Theorem 10.3 remains valid if the 
series X t is an auto-regressive process of "true" order p strictly smaller than the order 
p used to define the estimators. This follows because for p < p an AR(p ) process 
is also an AR(p) process, albeit that (j> Po +i, ■ ■ ■ ,fa are zero. Theorem 10.3 shows that 
"overfitting" (choosing too big an order) does not cause great harm: if (jyf> , . . . , <f>j' are 
the Yule- Walker estimators when fitting an AR(p) model, then 

V^$ p) - n(o, ^(r- 1 )^), j = po + i, . . . , P . 

It is recomforting that the estimators of the "unnecessary" coefficients fa +i, ■ ■ ■ , (j> P 
converge to zero at rate \j\fn. However, there is also a price to be paid by overfitting. 
By Theorem 10.3, if fitting an AR(p)-model, then the estimators of the first Po coefficients 
satisfy 

vfi : )- ! ] ] ^N{^o\T p l ) s , t=l _ P0 ). 

vUkv UJ/ 

The covariance matrix in the right side, the (po x p ) upper principal submatrix of the 
(pxp) matrix 17^ 1 , is not equal to T^ 1 , which would have been the asymptotic covariance 
matrix if we had fitted an AR model of the "correct" order p . In fact, it is bigger in 
that 

(r~ )s,t=i,..., P0 - r po > o. 

(Here we write A > for a matrix A if A is positive definite.) In particular, the diagonal 
elements of these matrices, which are the differences of the asymptotic variances of the 
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estimators fa p ' and the estimators <pf , are nonnegative. Thus overfitting leads to more 
uncertainty in the estimators of both fa,...,fa and fa +i, ■ ■ ■ , fa- Fitting an auto- 
regressive process of very high order p increases the chance of having the model fit well 
to the data, but generally will result in poor estimates of the coefficients, which render 
the final outcome less useful. 

10.6 EXERCISE. Prove the assertion that the given matrix is nonnegative definite. 

In practice we do not know the correct order to use. A suitable order is often 
determined by a preliminary data- analysis, such as an inspection of the plot of the 
sample partial auto-correlation function. More formal methods are discussed within the 
general context of maximum likelihood estimation in Chapter 12. 

10.7 Example. If we fit an AR(1) process to observations of an AR(1) series, then the 
asymptotic covariance of \Jn(fa — fa) is equal to a 2 Tf 1 = <r 2 /7x(0). If to this same 
process we fit an AR(2) process, then we obtain estimators (fa , <% ) (not related to 
the earlier fa) such that \/n(fa\ ' — fa, <% — fa) has asymptotic covariance matrix 

2r-i -Ji f-yx(0) TxWV 1 _ <** ( 7x(0) -7x(l) 



aT2 ° \lx{l) 7x(0)J 7 |(0)-7|(1) V-7x(l) 7JC (0) 

Thus the asymptotic variance of the sequence y/n(fa ' —fa) is equal to 

aV(0) = a 3 L_ 

7i(0)-7|(l) 7x(0)l-^' 

(Note that fa = 7x(l)/7x(0).) Thus overfitting by one degree leads to a loss in efficiency 
of 1 — fa 2 . This is particularly harmful if the true value of \fa \ is close to 1, i.e. the time 
series is close to being a (nonstationary) random walk. □ 



10.1.2 Partial Auto-Correlations 

Recall that the partial auto-correlation coefficient ax (h) is the coefficient of X\ in the 
formula P\Xh + ■ ■ ■ + PhXh for the best linear predictor of X^+i based on X\, . . . , Xh 
(in the case that \x = 0). In particular, for the causal AR(p) process satisfying X t = 

faX t -i -\ h faX t -p + Z t we have ax(p) = fa and ax(h) = for h > p. The sample 

partial auto-correlation coefficient is defined in Section 5.4 as the Yule- Walker estimator 
fa when fitting an AR(/i) model. This connection provides an alternative method to 
derive the limit distribution in the special situation of auto-regressive processes. The 
simplicity of the result makes it worth the effort. 
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10.8 Corollary. Let X t — [i be a causal stationary AR(p) process relative to an i.i.d. 
sequence Z t with finite fourth moments. Then, for every h > p, 

y/na n (h) ~* N(0,1). 

Proof. For h > p the time series X t — \x is also an AR(/i) process and hence we can 
apply Theorem 10.3 to find that the Yule- Walker estimators (j>[ ,...,<f> h ' when fitting 
an AR(/i) model satisfy 

JH{h-<i>h)^N{0,o 2 {T h l ) Kh ). 

The left side is exactly y/na n {h). We show that the variance on the right side is unity. 
By Cramer's rule the (h, /i)-element of the matrix T^ 1 can be found as det I\_i/ det IV 
By the prediction equations we have for h > p 



( 7x(0) 
7x(l) 



7x(l) 
7x(0) 



jx(h-l)\ 
lx{h-2) 



(4>i\ 



\lx{h-\) lx{h-2) 



Ix 



(0) / 



\ / 



/7x(l)\ 

lx{2) 

\lx{h)J 



This expresses the vector on the right as a linear combination of the first p columns of the 
matrix I\ on the left. We can use this to rewrite det I\ + i (by a "sweeping" operation) 
in the form 



7x(0) 
7*(1) 



7x(l) 
7x(0) 



7x(/i) 
lx{h-l) 



lx{h) ix{h-2) ■■■ 7x (0) 

7x(0) - ^i7x(l) 

7x(l) 



<t> P lx{p) 

7x(0) 




lx{h-l) 



lx{h) lx{h-2) ■■■ 7JC (0) 

The (1, l)-element in the last determinant is equal to a 2 by (10.1). Thus this determinant 
is equal to ct 2 det Yh and the theorem follows. ■ 

This corollary is used often when choosing the order p if fitting an auto-regressive 
model to a given observed time series. The true partial auto-correlation coefficients of 
lags higher than the true order p are all zero. When we estimate these coefficients by the 
sample auto-correlation coefficients, then we should expect that the estimates are inside 
a band of the type (—2/y/n,2-y/n). Thus we should not choose the order equal to p if 
Oi n (p+ k) is outside this band for too many k > 1. Here we should expect a fraction of 5 
% of the a n (p+ k) for which we perform this "test" to be outside the band in any case. 

To turn this procedure in a more formal statistical test we must also take the depen- 
dence between the different a n (p + k) into account, but this appears to be complicated. 
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* 10.9 EXERCISE. Find the asymptotic limit distribution of the sequence (a n (h),a n (h + 
1)) for h> p, e.g. in the case that p = and h = 1. 



* 10.1.3 Indirect Estimation 

The parameters fa , . . . , fa of a causal auto-regressive process are exactly the coefficients 
of the one-step ahead linear predictor using p variables from the past. This makes appli- 
cation of the least squares method to obtain estimators for these parameters particularly 
straightforward. For an arbitrary stationary time series the best linear predictor of X p+ \ 
given 1,X\,...,X P is the linear combination [x + fa (X p — fi) + ■ ■ ■ + fa (X\ — n) whose 
coefficients satisfy the prediction equations (2.1). The Yule-Walker estimators are the 
solutions to these equations after replacing the true auto-covariances by the sample 
auto-covariances. It follows that the Yule- Walker estimators can be considered estima- 
tors for the prediction coefficients (using p variables from the past) for any stationary 
time series. The case of auto-regressive processes is special only in that these prediction 
coefficients are exactly the parameters of the model. 

Furthermore, it remains true that the Yule- Walker estimators are -y/n-consistent 
and asymptotically normal. This does not follow from Theorem 10.3, because this uses 
the auto-regressive structure explicitly, but it can be inferred from the asymptotic nor- 
mality of the auto-covariances, given in Theorem 5.7. (The argument is the same as 
used in Section 5.4. The asymptotic covariance matrix will be different from the one in 
Theorem 10.3, and more complicated.) 

If the prediction coefficients (using a fixed number of past variables) are not the 
parameters of main interest, then these remarks may seem little useful. However, if the 
parameter of interest 6 is of dimension d, then we may hope that there exists a one-to- 
one relationship between 8 and the prediction coefficients fa,.. .,fa if we choose p = d. 
(More generally, we can apply this to a subvector of 6 and a matching number of fa's.) 
Then we can first estimate fa , . . . , fa\ by the Yule- Walker estimators and next employ 
the relationshiop between fa,. ..,fa to infer an estimate of 6. If the inverse map giving 
6 as a function of fa , . . . , fa\ is differentiable, then it follows by the Delta- method that 
the resulting estimator for 6 is y'n-consistent and asymptotically normal, and hence we 
obtain good estimators. 

If the relationship between 6 and (fa, . . . , <j>d) is complicated, then this idea may be 
hard to implement. One way out of this problem is to determine the prediction coefficients 
fa,..., fat for a grid of values of 6, possibly through simulation. The value on the grid 
that yields the Yule- Walker estimators is the estimator for 6 we are looking for. 

10.10 EXERCISE. Indicate how you could obtain (approximate) values for fa, . . . , cf> p 
given 8 using computer simulation, for instance for a stochastic volatility model. 
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10.2 Moment Estimators 

The Yule- Walker estimators can be viewed as arising from a comparison of sample auto- 
covariances to true auto-covariances and therefore are examples of moment estimators. 
Moment estimators are defined in general by matching sample moments and population 
moments. Population moments of series X t are true expectations of functions of the 
variables X t , for instance, 

EgX t , EgX t , EgX t+ hX t , EgX t+h X t . 

In every case, the subscript 6 indicates the dependence on the unknown parameter 6: 
in principle, every of these moments is a function of 6. The principle of the method 
of moments is to estimate 6 by that value 6 n for which the corresponding population 
moment coincides with a corresponding sample moment, for instance, 



-/,x t , — yx t , - y x t+h x t , - \ x t+h x 

t=i t=i t=i t=i 



From Chapter 5 we know that these sample moments converge, as n — > oo, to the true 
moments, and hence it is believable that the sequence of moment estimators 6 n also 
converges to the true parameter, under some conditions. 

Rather than true moments it is often convenient to define moment estimators 
through derived moments such as an auto-covariance at a fixed lag, or an auto-correlation, 
which are both functions of moments of degree smaller than 2. These derived moments 
are then matched by the corresponding sample quantities. 

The choice of moments to be used is crucial for the existence and consistency of the 
moment estimators, and also for their efficiency. 

For existence we shall generally need to match as many moments as there are pa- 
rameters in the model. If not, then we should expect that a moment estimator is not 
uniquely defined if we use fewer moments, and we expect to find no solution to the mo- 
ment equations if we try and match too many moments. Because in general the moments 
are highly nonlinear functions of the parameters, it is hard to make this statement pre- 
cise, as it is hard to characterize solutions of systems of nonlinear equations in general. 
This is illustrated already in the case of moving average processes, where a characteri- 
zation of the existence of solutions requires effort, and where conditions and restrictions 
are needed to ensure their uniqueness. (Cf. Section 10.2.1.) 

To ensure consistency and improve efficiency it is necessary to use moments that 
can be estimated well from the data. Thus auto-covariances at high lags, or moments of 
high degree should generally be avoided. Besides on the quality of the initial estimates 
of the population moments, the efficiency of the moment estimators depends also on the 
inverse map giving the parameter as a function of the moments. To see this we may 
formalize the method of moments through the scheme 



4>{9)=E 6 f{X t ,...,X t+h ), 

1 n 

4>(0n) = -Y,f(Xt,...,Xt+h). 



n 

t=i 
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Here /: M. h+1 — > M. d is a given map, which defines the moments used. (For definiteness we 
allow it to depend on the joint distribution of at most h + 1 consecutive observations.) 
We assume that the time series t <->■ f(X t , . . . ,X t +h) is strictly stationary, so that the 
mean values (f)(6) in the first line do not depend on t, and for simplicity of notation we 
assume that we observe X\,..., X n+ h, so that the right side of the second line is indeed 
an observable quantity. We shall assume that the map (f>: — > M. d is one-to-one, so that 
the second line uniquely defines the estimator 6 n as the inverse 



6„ 



1 " 

: -1 (/n)> fn = -^2f(X t ,...,X t+h ). 



n 

t=i 



We shall generally construct /„ such that it converges in probability to its mean <j)(6) 
as n — > oo. If this is the case and (f)^ 1 is continuous at (f)(6), then we have that 6 n — > 
(f)^ 1 (f)(6) = 6, in probability as n — > oo, and hence the moment estimator is asymptotically 
consistent. 

Many sample moments converge at y'n-rate, with a normal limit distribution. This 
allows to refine the consistency result, in view of the Delta-method, given by Theo- 
rem 3.14. If (f)^ 1 is differentiable at (f)(6) and y/n(f„ — (f)(6)) converges in distribution to 
a normal distribution with mean zero and covariance matrix S, then 

VE0 n -8)~> N(0, 4>' e - l Y,(4>' e - l ) T ). 

Here (fi'g^ 1 is the derivative of (f)^ 1 at (f)(6), which is the inverse of the derivative of (f> 
at 6, assumed to be nonsingular. We conclude that, under these conditions, the moment 
estimators are ^/n-consistent with a normal limit distribution, a desirable property. 

A closer look concerns the size of the asymptotic covariance matrix (f)' e ~ 1 Y,g((f)' e ~ 1 ) T . 
Clearly, it depends both on the accuracy by which the chosen moments can be estimated 
from the data (through the matrix S) and the "smoothness" of the inverse (f)^ 1 . If the 
inverse map has a "large" derivative, then extracting the moment estimator 6 n from the 
sample moments /„ magnifies the error of /„ as an estimate of (f)(6), and the moment 
estimator will be relatively inefficient. Unfortunately, it is hard to see how a particular 
implementation of the method of moments works out without doing (part of) the algebra 
leading to the asymptotic covariance matrix. Furthermore, the outcome may depend on 
the true value of the parameter, a given moment estimator being relatively efficient for 
some parameter values, but (very) inefficient for others. 

Moment estimators are measurable functions of the sample moments /„ and hence 
cannot be better than the "best" estimator based on f n . In most cases summarizing the 
data through the sample moments f n incurs a loss of information. Only if the sample 
moments are sufficient (in the statistical sense) , moment estimators can be fully efficient 
for estimating the parameters. This is an exceptional situation. The loss of information 
can be controlled somewhat by working with the right type of moments, but is usually 
unavoidable through the restriction of using only as many moments as there are param- 
eters. The reduction of a sample of size n to a "sample" of empirical moments of size d 
usually entails a loss of information. 

This observation motivates the generalized method of moments. The idea is to reduce 
the sample to more "empirical moments" than there are parameters. Given a function 
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f:M. h+1 — > K e for e > d with corresponding mean function (f>(6) = Egf(X t , . . . ,X t +h), 
there is no hope, in general, to solve an estimator 9 n from the system of equations 
4>(®) = fn, because these are e > d equations in d unknowns. The generalized method 
of moments overcomes this by defining 9 n as the minimizer of the quadratic form, for a 
given matrix V„, 

9 -> (4,(9) - f n ) T V n {(f>{e) - fn). 

Thus a generalized moment estimator tries to solve the system of equations 4>(8) = f„ 
as well as possible, where the discrepancy is measured through a certain quadratic form. 
The matrix V n weighs the influence of the different components of /„ on the estimator 9 n , 
and is typically chosen dependent on the data to increase the efficiency of the generalized 
moment estimator. We assume that V n is symmetric and positive-definite. 

As n — > oo the estimator /„ typically converges to its expectation under the true 
parameter, which we shall denote by 9q> for clarity. If we replace /„ in the criterion 
function by its expectation </>(#o), then we can reduce the resulting quadratic form to 
zero by choosing 9 equal to 9q. This is clearly the minimal value of the quadratic form, 
and the choice 9 = 9q will be unique as soon as the map 4> is one-to-one. This suggests 
that the generalized moment estimator 9 n is asymptotically consistent. As for ordinary 
moment estimators, a rigorous justification of the consistency must take into account the 
properties of the function 4>. 

The distributional limit properties of a generalized moment estimator can be un- 
derstood by linearizing the function 4> around the true parameter. Insertion of the first 
order Taylor expansion 4>(9) = 4(@o) + 4'e (® ~ ^o) into the quadratic form yields the 
approximate criterion 

9 -> (/„ - </>(6 ) ~ 4>'e (0 ~ Oo)) T V n (/„ - 4>(9 ) - 4>'o (0 ~ 9a)) 
= \{Z n - 4>' 9o yfa(B - 9 )) T V n (Z n - 4'e V^(8 ~ )), 

for Z n = y/n(f n — 4>(9 )). The sequence Z n is typically asymptotically normally dis- 
tributed, with mean zero. Minimization of this approximate criterion can be viewed as a 
weighted least squares approach to regressing the "dependent vector" Z n on the "design 
matrix" 4'e using the parameter yfn(9 — 9q). Standard linear regression theory shows 
that the approximate criterion is minimized over yfn(9 — 9q) € M. d by 

VZ(6 n - 9 ) = ((4>'e ) T Vn4>'ey(4>'e ) T V n Z n . 

Furthermore, the best nonrandom weight matrices V n , in terms of minimizing the covari- 
ance of y/n(9 n — 9), is the inverse of the covariance matrix of Z n . (Cf. the Gauss-Markov 
theorem.) For our present situation this suggests to choose the matrix V n to be consistent 
for the inverse of the asymptotic covariance matrix of the sequence Z n = \fn(f n — 4>(9o)) ■ 
With this choice and the asymptotic covariance matrix denoted by Y,g , we may expect 
that 

vfi(fl„ - 9 ) - N(o, ((fafX^r 1 ). 
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The argument shows that the generalized moment estimator can be viewed as a weighted 
least squares estimators for regressing y/n(f n — (f)(6o)) onto <j>' g . With the optimal weight- 
ing matrix it is the best such estimator. If we use more initial moments to define /„ and 
hence (f)(6), then we add "observations" and corresponding rows to the design matrix (j>' g , 
but keep the same parameter \fn(6 — do). This suggests that the asymptotic efficiency 
of the optimally weigthed generalized moment estimator increases if we use a longer 
vector of initial moments f n . In particular, the optimally weigthed generalized moment 
estimator is more efficient than an ordinary moment estimator based on a subset of d of 
the initial moments. Thus, the generalized method of moments achieves the aim of using 
more information contained in the observations. 

These arguments are based on asymptotic approximations. They are reasonably 
accurate for values of n that are large relative to the values of d and e, but should not be 
applied if d or e are large. In particular, it is illegal to push the preceding argument to 
its extreme and infer that is necessarily right to use as many initial moments as possible. 
Increasing the dimension of the vector /„ indefinitely may contribute more "variability" 
to the criterion (and hence to the estimator) without increasing the information much, 
depending on the accuracy of the estimator V n . 

The implementation of the (generalized) method of moments requires that the ex- 
pectations (f)(8) = Egf(X t ,...,X t +h) are available as functions of 6. In some models, 
such as AR or MA models, this causes no difficulty, but already in ARMA models the 
required analytical computations become complicated. Sometimes it is easy to simulate 
realizations of a time series, but hard to compute explicit formulas for moments. In this 
case the values (f)(6) may be estimated stochastically for a grid of values of 6 by sim- 
ulating realizations of the given time series, taking in turn each of the grid points as 
the "true" parameter, and next computing the empirical moment for the simulated time 
series. If the grid is sufficiently dense and the simulations are sufficiently long, then the 
grid point for which the simulated empirical moment matches the empirical moment of 
the data is close to the moment estimator. This method is called the simulated method 
of moments. 

10.2.1 Moving Average Processes 

Suppose that X t — fi = Y^\=o ®jZt-j ls amoving average process of order q. For simplicity 
of notation assume that 1 = 6o and define 6j = for j < or j > q. Then the auto- 
covariance function of X t can be written in the form 

lx (h) = o 2 Y,Wj+i- 

3 

Given observations X\ , . . . , X n we can estimate ^x (h) by the sample auto-covariance 
function and next obtain estimators for a 2 ,6i,...,6 q by solving from the system of 
equations 

%(h) =& 2 ^26j§ j+h , h = 0,l,...,q. 
j 
A solution of this system does not necessarily exist, or may be nonunique. It cannot 
be derived in closed form, but must be determined numerically by an iterative method. 



10.2: Moment Estimators 153 

Thus applying the method of moments for moving average processes is considerably 
more involved than for auto-regressive processes. The real drawback of this method is, 
however, that the moment estimators are less efficient than the least squares estimators 
that we discuss later in this chapter. Moment estimators are therefore at best only used 
as starting points for numerical procedures to compute other estimators. 

10.11 Example (MA(1)). For the moving average process X t = Z t +8Z t -i the moment 
equations are 

7x(O)=a 2 (l+0 2 ), 7x (l)=6a 2 . 

Replacing 7^ by 7™ and solving for <r 2 and 6 yields the moment estimators 



- _ 1 ± y/1 - 4/% (1) . 2 _ 7n (l) 

n ~ 2^(1) ' " e n • 

We obtain a real solution for 6 n only if |p n (l)| < 1/2. Because the true auto-correlation 
px (1) is contained in the interval [—1/2, 1/2], it is reasonable to truncate the sample auto- 
correlation p n (l) to this interval and then we always have some solution. If |p n (l)| < 1/2, 
then there are two solutions for 8 n , corresponding to the ± sign. This situation will 
happen with probability tending to one if the true auto-correlation px(l) is strictly 
contained in the interval (—1/2, 1/2). From the two solutions, one solution has \6 n \ < 1 
and corresponds to an invertible moving average process; the other solution has \6 n \ > 1. 
The existence of multiple solutions was to be expected in view of Theorem 7.27. 
Assume that the true value \6\ < 1, so that px(l) € (—1/2, 1/2) and 

1-^/1-4^(1) 
2px(l) 

Of course, we use the estimator 6 n defined by the minus sign. Then 6 n — 6 can be written 
as cf)(p n (l)) — ^>(px(l)) for the function cf> given by 



I-VT31? 



2/> 

This function is differentiable on the interval (—1/2,1/2). By the Delta-method the 
limit distribution of the sequence y/n(6 n — 6) is the same as the limit distribution of 
the sequence (f>'(px(]-))Vn(p n (l) — px(l))- Using Theorem 5.8 we obtain, after a long 
calculation, that 

Vn(6» n -9)~N (0, (1 _ g 2)2 J • 

Thus, to a certain extent, the method of moments, works: the moment estimator 6 n 
converges at a rate of 1/y/n to the true parameter. However, the asymptotic variance 
is large, in particular for 6 rj 1. We shall see later that there exist estimators with 
asymptotic variance 1 — 6 2 , which is smaller for every 6, and is particularly small for 
0m1. □ 
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10.12 EXERCISE. Derive the formula for the asymptotic variance, or at least convince 
yourself that you know how to get it. 

The asymptotic behaviour of the moment estimators for moving averages of order 
higher than 1 can be analysed, as in the preceding example, by the Delta-method as well. 

Define <j>: W+ l -> W+ l by 



O 



\ej 



( E^ \ 



\Y,j8j8j+q/ 



Then the moment estimators and true parameters satisfy 






\S q J 



( 7.(0) \ 
7x(l) 



f a2 \ 
9i 



\ej 



px(0)\ 
7x(l) 

\7x(«)/ 



The joint limit distribution of the sequences y/n(fi n {h) — 'jx(h)) is known from Theo- 
rem 5.7. Therefore, the limit distribution of the moment estimators a 2 , 9\, . . . , 6 q follows 
by the Delta-method, provided the map </> _1 is differentiate at (7^(0), . . . , "yx(q))- Ac- 
tually, the situation is more complicated, because the moment equations may have zero 
or multiple solutions, as illustrated in the preceding example. This difficulty disappears 
if we insist on an invertible representation of the moving average process, i.e. require that 
the polynomial 1 + Q\Z + ■ ■ ■ + 6 q z q has no roots in the complex unit disc. This follows 
by the following lemma, whose proof also contains an algorithm to compute the moment 
estimators numerically. 

10.13 Lemma. Let 0Cl' be the set of all vectors (61,... ,6 q ) such that all roots of 

1 + 01,2-1 h 6 q z q are outside the unit circle. Then the map <j>: M. + x is one-to-one and 

continuously differentiable. Furthermore, the map _1 is differentiable at every point 
<j){(j 2 ,6i,...,6 q ) for which the roots of 1 + 6\z -\ + 6 q z q are distinct. 

* Proof. Abbreviate 7^ = 'jx(h)- The system of equations v 2 J2j9j@j+h = lh for h = 



0, . . . , q implies that 



£ lh z h = a 2 Y.Y.Wi+k zh = * 2 0(s _1 M*)- 



For any h > the function z h + z~ h can be expressed as a polynomial of degree h in 
w = z + z^ 1 . For instance, z 2 + z~ 2 = w 2 — 2 and z 3 + z~ 3 = w 3 — 3w. The case 
of general h can be treated by induction, upon noting that by rearranging Newton's 
binomial formula 



z h+i +z -h-i_ w h + i 



ft + 1 

(/i+l)/2 






(z j +z j ). 
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Thus the left side of the preceding display can be written in the form 
7o + ^2 7.7 (-^ + z j ) = a o+ Q>iw + 1- a q w q , 

h=l 

for certain coefficients (ao,...,a q ). Let w\,...,w q be the zeros of the polynomial on 

the right, and for each j let r]j and r)J l be the solutions of the quadratic equation 

z + z^ 1 = Wj. Choose \r\j\ > 1. Then we can rewrite the right side of the preceding 

display as 

i 

a q Y[(z + z- 1 - Wj) = a q (z - t)j){t)j - z- l )r)j l . 
j=i 

On comparing this to the first display of the proof, we see that r}\ , . . . , rj q are the zeros 
of the polynomial 6(z). This allows us to construct a map 

(70,.. .,7,) i-> (o ,...,o g ) i-> (wi,...,w q ,a q ) i-> (rn,...,r] q ,a q ) i-> (61,... ,6 q ,a 2 ). 

If restricted to the range of cf> this is exactly the map 4>~ x . It is not hard to see that the 
first and last step in this decomposition of (j)^ 1 are analytic functions. The two middle 
steps concern mapping coefficients of polynomials into their zeros. 

For ce = (ceo, ■ ■ ■ , o-q) € C 9 " 1 " 1 let p a (w) = ceo + cei^ + ■ ■ ■ + a q w q . By the implicit 
function theorem for functions of several complex variables we can show the following. 
If for some ce the polynomial p a has a root of order 1 at a point w a , then there exists 
neighbourhoods U a and V a of ce and w a such that for every j3 € U a the polynomial pp 
has exactly one zero wp € V a and the map j3 >-> wp is analytic on U a . Thus, under the 
assumption that all roots are or multiplicity one, the roots can be viewed as analytic 
functions of the coefficients. If 6 has distinct roots, then 771, . . . , rj q are of multiplicity one 
and hence so are w\,...,w q . In that case the map is analytic. ■ 



* 10.2.2 Moment Estimators for ARMA Processes 

If X t — n is a stationary ARMA process satisfying </>(B)(X t — p) = 8(B)Z t , then 

cov(<j>(B)(X t - n),Xt-k) = E(6(B)Z t )X t _ k . 

If X t — p is a causal, stationary ARMA process, then the right side vanishes for k > q. 
Working out the left side, we obtain the eqations 

lx(k) -4>ilx{k- 1) 4> P lx(k-p) =0, k > q. 

For k = q + l,...,q+p this leads to the system 

/ lx{q) 7x(g-l) ••■ Ix{q-P+1)\ ( 4>i\ (lx{q+l)\ 



7x(g + l) lx{q) ••■ Jx(q-P + 2) 

\jx(q + p-i) jx(q + p-2) ••■ lx{q) / 



\4>J 



\ix(q + p)/ 
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These are the Yule-Walker equations for general stationary ARM A processes and may 
be used to obtain estimators fa , . . . , fa of the auto- regressive parameters in the same 
way as for auto-regressive processes: we replace ^x by % and solve for fa,..., fa. 

Next we apply the method of moments for moving averages to the time se- 
ries Y t = 6(B)Z t to obtain estimators for the parameters a 2 ,6\,.. .,6 q . Because also 
Y t = 4>{B) (X t — /f ) we can estimate the covariance function 7y from 



lY {h) = Y^ X! fa4>jlx{h + i-j), if 4>(z) = X faz 3 . 



Let 7y(/i) be the estimators obtained by replacing the unknown parameters fa and 
lx{h) by their moment estimators and sample moments, respectively. Next we solve 
a 2 ,Q\,.. . ,6 q from the system of equations 



7y(ft) =a 2 ^2§ j e j+h , h = 0,l,...,q. 



As is explained in the preceding section, if X t — /j, is invertible, then the solution is 
unique, with probability tending to one, if the coefficients 6\, . . . , 6 q are restricted to give 
an invertible stationary ARM A process. 

The resulting estimators (<r 2 , 6\ , . . . , 6 q , fa , . . . , fa) can be written as a function of 
(7n(0), . . . , %(q + p)) ■ The true values of the parameters can be written as the same 
function of the vector (7^(0), ■ ■ ■ ,7x(<Z +P))- In principle, under some conditions, the 
limit distribution of the estimators can be obtained by the Delta-method. 



10.2.3 Stochastic Volatility Models 

In the stochastic volatility model discussed in Section 9.3 an observation X t is de- 
fined as X t = (JtZ t for log<7 t a stationary auto-regressive process satisfying log<7 t = 
a + (f>\og(Tt-i + aVt-i, and (Vt,Z t ) an i.i.d. sequence of bivariate normal vectors with 
mean zero, unit variances and correlation S. Thus the model is parameterized by four 
parameters a, fa a, 6. 

The series X t is a white noise series and hence we cannot use the auto-covariances 
'Yx(h) at lags ft ^ to construct moment estimators. Instead, we might use higher 
marginal moments or auto-covariances of powers of the series. In particular, it is com- 
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puted in Section 9.3 that 



(1 + ^V^e 4 - 2 ^ 1 -* 2 ) - 1 



Px 2 (l) 
Px'(2) 
Px 2 (3) 



3e 4^/(i-* 2 ) - 1 

(l+4J 2 ^)e 4 ^'/' 1 -^)-l 

3e 4^/(l-0 2 ) - 1 : 

(l+^gW^ 3 ^ 1 -^ 3 )-! 

3e 4^/(l-* 2 ) - 1 



We can use a selection of these moments to define moment estimators, or use some or 
all of them to define generalized moments estimators. Because the functions on the right 
side are complicated, this requires some effort, but it is feasible. 1 ' 



10.3 Least Squares Estimators 

For auto-regressive processes the method of least squares is directly suggested by the 
structural equation defining the model, but can also be derived from the prediction 
problem. The second point of view is deeper and can be applied to general time series. 
A least squares estimator is based on comparing the predicted value of an observa- 
tion X t based on the preceding observations to the actually observed value X t . Such a 
prediction n t _iX t will generally depend on the underlying parameter 6 of the model, 
something we make visible in the notation by writing it as n t _iX t (0). The index t — 1 of 
II t _i indicates that H t -iX t {6) is a function of X\, . . . ,X t -\ (and the parameter) only. 
By convention we define U Xi = 0. A weighted least squares estimator, with inverse 
weights w t (6), is defined as the minimizer, if it exists, of the function 

(io.2) , A(x,-n,,x, W )' 

This expression depends only on the observations X\ , . . . , X n and the unknown param- 
eter 6 and hence is an "observable criterion function" . The idea is that using the "true" 



See Taylor (1986). 
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parameter should yield the "best" predictions. The weights w t (6) could be chosen equal 
to one, but are generally chosen to increase the efficiency of the resulting estimator. 

This least squares principle is intuitively reasonable for any sense of prediction, 
in particular both for linear and nonlinear prediction. For nonlinear prediction we set 
H t -iX t {6) equal to the conditional expectation Eg(X t | X\, . . . , X t -i), an expression that 
may or may not be easy to derive analytically. 

For linear prediction, if we assume that the the time series X t is centered at mean 
zero, we set n t _iA" t (0) equal to the linear combination fi\X t _\ + ■■■ + fi t -\X\ that 
minimizes 

(A , ■ ■ ■ , A-i) ■-> E e (x t - (ftXt-i + ■ ■ ■ + Pt-iXi)) 2 , A ftea 

In Chapter 2 the coefficients of the best linear predictor are expressed in the auto- 
covariance function ^ x by the prediction equations (2.1). Thus the coefficients /? t de- 
pend on the parameter 6 of the underlying model through the auto-covariance function. 
Hence the least squares estimators using linear predictors can also be viewed as moment 
estimators. 

The difference X t — n t -iX t (0) between the true value and its prediction is called 
innovation. Its second moment 

« t _i(fl) =E e (X t -U^Xtie)) 2 

is called the (square) prediction error at time t — 1. The weights w t (6) are often chosen 
equal to the prediction errors v t -i(6) in order to ensure that the terms of the sum of 
squares contribute "equal" amounts of information. 

For both linear and nonlinear predictors the innovations X\ — IIoA"i(0), X2 — 
U\X2{6),.. .,X n — Hn-iX n {6) are uncorrelated random variables. This orthogonality 
suggests that the terms of the sum contribute "additive information" to the criterion, 
which should be good. It also shows that there is usually no need to replace the sum 
of squares by a more general quadratic form, which would be the standard approach in 
ordinary least squares estimation. 

Whether the sum of squares indeed possesses a (unique) point of minimum 6 and 
whether this constitutes a good estimator of the parameter 6 depends on the statistical 
model for the time series. Moreover, this model determines the feasibility of computing 
the point of minimum given the data. Auto-regressive and GARCH processes provide a 
positive and a negative example. 

10.14 Example (AR). A mean-zero causal, stationary, auto-regressive process of order 
p is modelled through the parameter 6 = (a 2 ,(j>i,..., cf> p ) . For t > p the best linear 
predictor is given by II t _iA" t = (j>iX t -i + ■ ■ ■ (f> p X t - p and the prediction error is v t -i = 
EZf = a 2 . For t < p the formulas are more complicated, but could be obtained in 
principle. 

The weighted sum of squares with weights w t = v t -i reduces to 

Y^ (x t - u t -iX t (4>i, .. -,4> P )) , ^ (x t - 4>iX t -i 4> p x t -p) 

^ Vt-±(o 2 AlT--A P ) 4^ ^ ' 
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Because the first term, consisting of p of the n terms of the sum of squares, possesses a 
complicated form, it is often dropped from the sum of squares. Then we obtain exactly 
the sum of squares considered in Section 10.1, but with X n replaced by and divided by 
a 2 . For large n the difference between the sums of squares and hence between the two 
types of least squares estimators should be negligible. 

Another popular strategy to simplify the sum of squares is to act as if the "ob- 
servations" X ,X_i, . . . ,X_ p _|_i are available and to redefine II t _iA" t for t = l,...,p 
accordingly. This is equivalent to dropping the first term and letting the sum in the 
second term start at t = 1 rather than at t = p+ 1. To implement the estimator we must 
now choose numerical values for the missing observations X , X-i, . . . , X- p+ i; zero is a 
common choice. 

The least squares estimators for fa,...,cf) p , being (almost) identical to the Yule- 
Walker estimators, are -y/n-consistent and asymptotically normal. However, the least 
squares criterion does not lead to a useful estimator for ct 2 : minimization over a 2 leads 
to ct 2 = oo and this is obviously not a good estimator. A more honest conclusion is that 
the least squares criterion as posed originally fails for auto-regressive processes, since 
minimization over the full parameter 8 = (a 2 ,<j>i,..., <j> p ) leads to a zero sum of squares 
for ct 2 = oo and arbitrary (finite) values of the remaining parameters. The method of 
least squares works only for the subparameter (fa,..., (j> p ) if we first drop a 2 from the 
sum of squares. □ 

10.15 Example (GARCH). A GARCH process is a martingale difference series and 
hence the one-step predictions H t -iX t (6) are identically zero. Consequently, the weighted 
least squares sum, with weights equal to the prediction errors, reduces to 



£ 



x? 



Minimizing this criterion over 6 is equivalent to maximizing the prediction errors Vt-i(6). 
It is intuitively clear that this does not lead to reasonable estimators. 

One alternative is to apply the least squares method to the squared series X 2 . This 
satisfies an ARMA equation in view of (8.3). (Note however that the innovations in 
that equation are also dependent on the parameter.) The best fix of the least squares 
method is to augment the least squares criterion to the Gaussian likelihood, as discussed 
in Chapter 12. □ 

So far the discussion in this section has assumed implicitly that the mean value 
ix = EX t of the time series is zero. If this is not the case, then we apply the preceding 
discussion to the time series X t — fx instead of to X t , assuming first that [x is known. 
Then the parameter [x will show up in the least squares criterion. To define estimators we 
can either replace the unknown value [x by the sample mean X n and minimize the sum 
of squares with respect to the remaining parameters, or perform a joint minimization 
over all parameters. 

Least squares estimators can rarely be written in closed form, the case of stationary 
auto-regressive processes being an exception, but iterative algorithms for the approximate 
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calculation are implemented in many computer packages. For instance, Newton-type 
algorithms. The best linear predictions II t _iX t are often computed recursively in t (for 
a grid of values 6) , for instance with the help of a state space representation of the time 
series and the Kalman filter. We do not discuss this numerical aspect, but remark that 
even with modern day computing power, the use of a carefully designed algorithm is 
advisable. 

The method of least squares is closely related to Gaussian likelihood, as discussed in 
Chapter 12. Gaussian likelihood is perhaps more fundamental than the method of least 
squares. For this reason we restrict further discussion of the method of least squares to 
ARMA processes. 

10.3.1 ARMA Processes 

The method of least squares works very well for estimating the regression and moving 
average parameters (fa, . . . ,<j> p ,6i, . . . ,6 q ) of ARMA processes, if we perform the min- 
imization for a fixed value of the parameter a 2 . In general, if some parameter, such as 
<j 2 for ARMA processes, enters the covariance function as a multiplicative factor, then 
the best linear predictor U t X t+ i is free from this parameter, by the prediction equations 
(2.1). On the other hand, the prediction error v t +i = 7x(0) — (/?i, . . . , /3 t )r t (/3i, . . . , (3 t ) T 
(where J3i,...,j3 t are the coefficients of the best linear predictor) contains the parameter 
as a multiplicative factor. It follows that the inverse of the parameter will enter the least 
squares criterion as a multiplicative factor. Thus on the one hand the least squares meth- 
ods does not yield an estimator for this parameter; on the other hand, we can just omit 
the parameter and minimize the criterion over the remaining parameters. In particular, 
in the case of ARMA processes the least squares estimators for (fa, . . . , fa, G\, . . . , 6 q ) 
are defined as the minimizers of, for v t = a~ 2 v t , 



£ 

t=i 



(X t - U t _iX t (4>i ,...,4> p ,6i,...,8 q )) 

V t -l(fa, ■ ■ ■ ,fa,6\, ■ ■ ■ ,6q) 



This is a complicated function of the parameters. However, for a fixed value of 
((j>i, . . . ,(j>p,6i, . . . ,6 q ) it can be computed using the state space representation of an 
ARMA process and the Kalman filter. 

10.16 Theorem. Let X t be a causal and invertible stationary ARMA(p, q) process rela- 
tive to an i.i.d. sequence Z t with finite fourth moments. Then the least squares estimators 
satisfy 

^((|)-(t)H°^ 

where J? g is the covariance matrix of (U-i , . . . , U- p , V-i , . . . , V- q ) for stationary auto- 
regressive processes Ut and Vt satisfying (f>(B)U t = 6(B)V t = Z t . 

Proof. The proof of this theorem is long and technical. See e.g. Brockwell and 
Davis (1991), pages 375-396, Theorem 10.8.2. ■ 
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10.17 Example (MA(1)). The least squares estimator 6 n for 6 in the moving average 
process X t = Z t + 6Z t -i with \6\ < 1 possesses asymptotic variance equal to a 2 / var V_i, 
where V t is the stationary solution to the equation 6(B)V t = Z t . Note that V t is an auto- 
regressive process of order 1, not a moving average! 

As we have seen before the process V t possesses the representation V t = YIJLq & %t-j 
and hence varVj = & 2 /(l — 2 ) for every t. 

Thus the sequence y/n(6 n —6) is asymptotically normally distributed with mean zero 
and variance equal to 1 — 6 2 . This should be compared to the asymptotic distribution of 
the moment estimator, obtained in Example 10.11. □ 

10.18 EXERCISE. Find the asymptotic covariance matrix of the sequence \fn{(j) n — 
(f>,6 n — 6) for ((f> n ,6 n ) the least squares estimators for the stationary, causal, invertible 
ARMA process satisfying X t = (f>X t -i + Z t + 0Z t -i. 



11 

Spectral Estimation 



In this chapter we study nonparametric estimators of the spectral density and spectral 
distribution of a stationary time series. As in Chapter 5 "nonparametric" means that no 
a-priori structure of the series is assumed, apart from stationarity. 

If a well-fitting model is available, then an alternative to the methods of this chapter 
is to use spectral estimators suited to this model. For instance, the spectral density of 
a stationary ARM A process can be expressed in the parameters a 2 , <j>i, . . . <j> p , 9\, . . . , 6 q 
of the model. It is natural to use the formula given in Section 7.5 for estimating the 
spectrum, by simply plugging in estimators for the parameters. If the ARM A model is 
appropriate, this should lead to better estimators than the nonparametric estimators 
discussed in this chapter. We do not further discuss this type of estimator. 

Let the observations X\ , . . . , X n be the values at times 1, . . . , n of a stationary time 
series X t , and let % be their sample auto-covariance function. In view of the definition 
of the spectral density fx (A) , a natural estimator is 

(11-1) /n,r(A) = ^ E "ln{h)e- ih \ 

\h\<T 

Whereas fxW is defined as an infinite series, the estimator /„ jT . is truncated at its rth 
term. Because the estimators %(h) are defined only for \h\ < n and there is no hope of 
estimating the auto-covariances 'jx(h) for lags \h\ > n, we must choose r <n. Because 
the estimators %(h) are unreliable for \h\ rj n, it may be wise to choose r much smaller 
than n. We shall see that a good choice of r depends on the smoothness of the spectral 
density and also on which aspect of the spectrum is of interest. For estimating /x(A) 
at a point, values of r such as n a for some a € (0, 1) may be appropriate, whereas for 
estimating the spectral distribution function (i.e. areas under fx) the choice r = n works 
well. 

In any case, since the covariances of lags \h\ > n can never be estimated from 
the data, nonparametric estimation of the spectrum is hopeless, unless one is willing 
to assume that expressions such as Sm>„|7x(/i)| aie small. In Section 11.3 ahead we 
relate this tail series to the smoothness of the function A i-> fx (A) . 
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11.1 Finite Fourier Transform 



The finite Fourier transform is a useful tool in spectral analysis, both for theory and 
practice. The practical use comes from the fact that it can be computed very efficiently 
by a clever algorithm, the Fast Fourier Transform (FFT)." 

The finite Fourier transform of an arbitrary sequence x\ , . . . , x„ of complex numbers 
is defined as the function A i-> d x (A) given by 



1 n 
<4(A) = ^y> t e- lA *, Ae(-7r,7r]. 



In other words, the function y/n/2-rr d x (A) is the Fourier series corresponding to the 

coefficients . . . , 0, 0, x\, X2, ■ ■ ■ , x„, 0,0, The inversion formula (or a short calculation) 

shows that 

x t = ^ [ e itx d x (X)dX, i = l,2,...,n. 
2?r J_ v 

Thus there is a one-to-one relationship between the numbers xi,...,x„ and the function 
d x ; we may view the function d x as "encoding" the numbers xi,...,x„. 

Encoding n numbers by a function on the interval (— 7r,7r] is rather inefficient. At 
closer inspection the numbers xi,...,x„ can also be recovered from the values of d x on 
the grid 

47T 2-7T 2-7T 4-7T 
..., , ,0, , ,... C(-7T,7T. 

n n n n 
These n points are called the natural frequencies at time n. 

11.1 Lemma. If d x is the finite Fourier transform of x\, . . . , x n € C, then 

x t = T ri(ye" J ' , i=l,2,...,n, 

where the sum is computed over the natural frequencies Xj € (— n, n] at time n. 

Proof. For every of the natural frequencies Xj define a vector 

e j = ^={e iX ',e i2X ',...,e inX '). 

It is straightforward to check that the n vectors ej form an orthonormal set in C™ and 
hence a basis. Thus the vector x = {x\ , . . . , x n ) can be written as x = ^2j(x, ej)ej. Now 
(x, ej) = d x (Xj) and the lemma follows. ■ 

The proof of the preceding lemma shows how the numbers d x (Xj) can be interpreted. 
View the coordinates of the vector x = (xi, . . . , x„) as the values of a signal at the time 
instants 1, 2, . . . , n. Similarly, view the coordinates of the vector ej as the values of the 
pure trigonometric signal t *-> n - 1 / 2 e ttx i a t these time instants. By the preceding lemma 
the signal x can be written as a linear combination of the signals ej. The value |c4(Aj)| 
is the weight of signal ej, and hence of frequency Xj, in x. 



See e.g. Brockwell and Davis, Chapter 10 for a discussion. 
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11.2 EXERCISE. How is the weight of frequency expressed in (x\,... ,x n )1 

11.3 EXERCISE. Show that d(^,^,..., M )(Aj) = for every natural frequency Xj and every 
fi e C. Conclude that d x _ix„(\j) = d x (Xj). 



11.2 Periodogram 

The periodogram of a sequence of observations X\ , . . . , X n is defined as the function 
A i-> I n {\) given by 



iW-JLlfcWl'-jL 



t=i 



itX 



We write I ny x if the dependence on X\, . . . , X n needs to be stressed. 

In view of the interpretation of the finite Fourier transform in the preceding section 
I n {X) is the square of the weight of frequency A in the signal X\, . . . , X n . The spectral 
density fx(X) can be interpreted as the variance of the component of frequency A in 
the time series X t . Thus I n (X) appears to be a reasonable estimator of the spectral 
density. This is true to a certain extent, but not quite true. While we shall show that the 
expected value of I n (X) converges to fx(X), we shall also show that there are much better 
estimators than the periodogram. Because these will be derived from the periodogram, 
it is still of interest to study its properties. 

By evaluating the square in its definition and rearranging the resulting double sum, 
the periodogram can be rewritten in the form (if x\ , . . . , x„ are real) 



(11.2) 



1 /I n ~ lhl ^ 

'-w = *; E (- E **in*> 

\h\<n t=l 



-ihX 



For natural frequencies Xj ^ we have that dx-i^{Xj) = for every [i, in particular for 



H = X n . This implies that I n ,x(Xj) = I 



n,X-X; 



(Xj) and hence 



(11.3) 



!n(Xj) = — Y^ 7n(h)e- 



ikX 



\h\<n 



1-K 

X, e—i- {0}. 

n 



This is exactly the estimator f nyn (X) given in (11.1). As noted before, we should expect 
this estimator to be unreliable as an estimator of fx (A) , because of the imprecision of 
the estimators %(h) for lags \h\ close to n. 
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Assuming that the time series X t is stationary, we can compute the mean of the 
periodogram, for A ^ 0, as 

EJ n (A) = i-E|d x (A)| 2 = -!- var^(A) + J-|Ed x (A)| 2 

Z7T Z7T Z7T ' 



s=l t=l 



27r ^^ V n J '^ y ' ' ' 2-im 

\h\<n 



1-e 



1-e 



The second term on the far right is of the order 0(l/n) for every A ^ and even vanishes 
for every natural frequency Aj ^ 0. Under the condition that ^ h |7x(/i)| < oo, the first 
term converges to /x(A) as n — > oo, by the dominated convergence theorem. We conclude 
that the periodogram is asymptotically unbiased for estimating the spectral density in 
that EI n (X) — > /x(A). This is a good property. Unfortunately, the periodogram is not a 
consistent estimator for fx (A) : the following theorem shows that I n (A) is asymptotically 
exponentially distributed with mean /x(A), whence we do not have that i™(A) E± /x(A). 
Using the periodogram as an estimator of /x(A) is, for n — > oo, equivalent to estimating 
/x(A) based on one observation with an exponential distribution. This is disappointing, 
because we should hope that after observing the time series X t long enough, we would 
be able to estimate its spectral density with arbitrary precision. The periodogram does 
not fulfill this hope, as it keeps fluctuating around the target value fxW- Apparently, 
it does not effectively use the information available in the observations X\,... ,X n . 

11.4 Theorem. Let X t = ^ ipjZ t -j for an i.i.d. sequence Z t with mean zero and finite 
second moment and coefficients ipj with V \ipj | < oo. Then for any values < fi\ < ■ ■ ■ < 
[Xk < 7r the variables I n (ni), . . . ,I n (fJ>k) are asymptotically distributed as independent 
exponential variables with means /x(/*i), • • • , fxiHk), respectively 

Proof. First consider the case that X t = Z t for every t. Then the spectral density fx (A) 
is the function /z(A) = a 2 /2n, for <r 2 the variance of the white noise sequence. We can 
write 



d z {\) = -= Y" Z t cos(Xt) - %-= V Z t sin(Ai) =: A n {\) - iB r 



(A). 



By straightforward calculus we find that, for any A, n € (0,7r), 

t=i 



cov(A n (\),A n (»)) = ^X> S ( A *) C0S (A<> "> (q 2/2 x\Zu 

t=i *• ' "' 

cov(B„(A),B„(a*)) = — f>n(Af) sin(^) -> { f /2 J* 

t=i ^ 

2 n 

co\(A n (\),B n ([j,)} = — >Jcos(A£)sin(/rf) — > 0. 



A*. 

if A ^ n, 



t=i 
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By the Lindeberg central limit theorem, Theorem 3.15, we now find that the se- 
quence of vectors (A n (\), B n (A), A n (fi), B n (n)) converges in distribution to a vector 
(Gi,G2,G3,G4) with the N 4 (0,(a 2 /2)l) distribution. Consequently, by the continuous 
mapping theorem, 

(J n (A), /„(/*)) =^{Al{X) + Bl{X),Al{ l i)+Bl{ l i)) ^±-(Gl + Gl,G\ + Gl). 

The vector on the right is distributed as a 2 /(4n) times a vector of two independent x| 
variables. Because the chisquare distribution with two degrees of freedom is identical to 
the standard exponential distribution with parameter 1/2, this is the same as a vector 
of two independent exponential variables with means a 2 /(2n). 

This concludes the proof in the special case that X t = Z t and for two frequencies A 
and [i. The case of k different frequencies fi\, . . . , fik can be treated in exactly the same 
way, but is notationally more involved. 

Now consider the case of a general time series of the form X t = V tpjZ t -j. Then 
/x(A) = |V'(A)| 2 /z(A), for tp(X) = *Yai>je-~ % ^ the transfer function of the linear filter. 
We shall prove the theorem by showing that the periodograms I ny x and I„ t z satisfy a 
similar relation, approximately. Indeed, rearranging sums we find 

1 n 1 n-j 



*=1 3 3 



= l"j 



If we replace the sum Yl™=i-j m the right side by the sum X^s=i > ^h en the right side 
of the display becomes ip{\)d z {\). These two sums differ by 2(|j| A n) terms, every of 
the terms Z s e~ lXt having mean zero and variance bounded by <r 2 , and the terms being 
independent. Thus 



E 



2 , „l.7'l An 



-1= V Z s e- isX - d z {\) 

^s^-3 



< 2^ a 1 . 

n 



1 I 1 } 

In view of the inequality E|X| < (EX 2 ) , we can drop the square on the left side if we 
take a root on the right side. Next combining the two preceding displays and applying 
the triangle inequality, we find 

E|d*(A) - *K\)dzW\ < £l^l( 2 ^^) 1/2 ^ 

3 

1 /9 

The jth term of the series is bounded by |V>j|(2|j|/n) a and hence converges to zero 
as n — > oo, for every fixed j; it is also dominated by \1pj\V2~a. Therefore, the right side 
of preceding display converges to zero as n — > oo. 

By Markov's and Slutsky's lemmas it follows that dx(X) has the same limit distri- 
bution as ip(\)dz(\). By the continuous mapping theorem I n ,x{X) has the same limit 
distribution as V'(A) I n ,z(X). This is true for every fixed A, but also for finite sets 
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1 2 

of A jointly. The proof is finished, because the variables VM In,z(X) are distributed 

I 1 2 

as independent exponential variables with means VM fz(X), by the first part of the 
proof. ■ 

A remarkable aspect of the preceding theorem is that the periodogram values 7„(A) 
at different frequencies are asymptotically independent. This is well visible already for 
finite values of n in plots of the periodogram, which typically have a wild and peaky 
appearance. The theorem says that for large n such a plot should be similar to a plot 
of independent exponentially distributed variables E\ with means /x(A) (on the y-axis) 
versus A (on the x-axis). 




o.o 
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Figure 11.1. Periodogram of a realization of the moving average X.t = 0.5Zt + 0.2Zt — l + 0.5Zt — 2 for a 
Gaussian white noise series. (Vertical scale in decibel, i.e. log.) 

The following theorem shows that we even have independence of the periodogram 
values at natural frequencies that converge to the same value. 

11.5 Theorem. Let X t = Yl ipjZt-j for an i.i.d. sequence Z t with finite second moments 
and coefficients ipj with V- \ipj\ < oo. Let X n = (2-K/n)j n for j n £ Z be a sequence 
of natural frequencies such that X n — > A € (0,7r). Then for any k € Z the variables 
In(X n — k2n/n), I n (\ n — (k— l)2ir/n) , . . . , I n (X n + k2n/n) are asymptotically distributed 
as independent exponential variables with mean fx(X)- 
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Proof. The second part of the proof of Theorem 11.4 is valid uniformly in A and hence 
applies to sequences of frequencies X n . For instance, the continuity of ip{X) and the proof 
shows that \dx(n n ) — ^{^njd z{^n)\ -^ for any sequence ix n . It suffices to extend the 
first part of the proof, which concerns the special case that X t = Z t . 

Here we apply the same method as in the proof of Theorem 11.4. The limits of 
the covariances are as before, where in the present case we use the fact that we are 
considering natural frequencies only. For instance, 

cov M fc ?)'M^)) = iriX^M"?) =0 ' 

t=i 

for every integers k,l such that (k + l)/n and (k — l)/n are not contained in Z. An 
application of the Lindeberg central limit theorem concludes the proof. ■ 

The sequences of frequencies X n + j(2n/n) considered in the preceding theorem all 
converge to the same value A. That Theorem 11.4 remains valid (it does) if we replace 
the fixed frequencies fij in this theorem by sequences [Xj^ n — > fij is not very surprising. 
More surprising is the asymptotic independence of the periodograms I n (/j, n j) at different 
frequencies fj, n j even if every sequence /j, n j converges to the same frequency A. As the 
proof of the preceding theorem shows, this depends crucially on using natural frequencies 

The remarkable independence of the periodogram at frequencies that are very close 
together is a further explanation of the peaky appearance of the periodogram I n (A) as 
a function of A. It is clear that this function is not a good estimator of the spectral 
density. However, the independence suggests ways of improving our estimator for fx (A) . 
The values I n (X n — k2ir/n),I n (A n — (k — l)27r/n) , . . . , I n (X n + k2n/n) can be viewed as 
a sample of independent estimators of fx (A) , for any k. Rather than one exponentially 
distributed veriable, we therefore have many exponentially distributed variables, all with 
the same (asymptotic) mean. We exploit this in the next section. 

In practice the periodogram may have one or a few extremely high peaks that 
completely dominate its graph. This indicates an important cyclic component in the time 
series at those frequencies. Cyclic components of smaller amplitude at other frequencies 
may be hidden. It is practical wisdom that in such a case a fruitful spectral analysis at 
other frequencies requires that the peak frequencies are first removed from the signal (by 
a filter with the appropriate transfer function). We next estimate the spectrum of the 
new time series and, if desired, transform this back to obtain the spectrum of the original 
series, using the formula given in Theorem 6.9. Because a spectrum without high peaks 
is similar to the uniform spectrum of a white noise series, this procedure is known as 
prewhitening of the data. 



11.3 Estimating a Spectral Density 

Given A £ (0,7r) and n, let X n be the natural frequency closest to A. Then X n — > X as 
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n — > oo and Theorem 11.5 shows that for any k £ Z the variables I n (X n + j2n/n) for 
j = —k, . . . , k are asymptotically distributed as independent exponential variables with 
mean fx (A) . This suggests to estimate fx (A) by the average 

(H.4) to = ^I>( A ^'> 

\j\<k 

As a consequence of Theorem 11.5, the variables (2k + 1)/&(A) are asymptotically dis- 
tributed according the gamma distribution with shape parameter 2k + 1 and mean 
(2k + l)fx(X). This suggests a confidence interval for fx(X) of the form, with x\ a 
the upper a-quantile of the chisquare distribution with k degrees of freedom, 

I (4k + 2) f k (X) (4k + 2)f k (X) \ 

\ X4k+2,a X4k+2,l-a J 

11.6 EXERCISE. Show that, for every fixed k, this interval is asymptotically of level 
I -2a. 

Instead of a simple average we may prefer a weighted average. For given weights Wj 
such that V Wj = 1, we use 

(11.5) A (A) = Y,W J I n (\ n + ^-j). 

j 

This allows to give greater weight to frequencies X n + (2n/n)j that are closer to A. A 
disadvantage is that the asymptotic distribution is relatively complicated: it is a weighted 
sum of independent exponential variables. Because tabulating these types of distributions 
is complicated, one often approximates it by a scaled chisquare distribution, where the 
scaling and the degrees of freedom are chosen to match the first two moments: the 
estimator c _1 /fe(A) is approximately xl distributed for c and v solving the equations 

asymptotic mean of fk (A) = fx (A) = cv, 

asymptotic variance of /fc(A) = VW,'/|(A) = <?2v. 

j 

This yields c proportional to fx(X) and v independent of fx(X), and thus confidence 
intervals based on this approximation can be derived as before. Rather than using the 
approximation, we could of course determine the desired quantiles by computer simula- 
tion. 

Because the periodogram is continuous as a function of A, a discrete average (over 
natural frequencies) can be closely approximated by a continuous average of the form 

(11.6) f w (X) = J W(Lu)I nX _ lY (X - u>) du. 

Here the weight function W is to satisfy J W(ui) dcu = 1 and would typically concentrate 
its mass around zero, so that the average is computed over I n x-ixi^) f° r w ~ A. We use 
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Figure 11.2. Smoothed periodogram of a realization of the moving average Xt 
for a Gaussian white noise series. (Vertical scale in decibel, i.e. log.) 



:0.5Z t + 0.2Z t _i+0.5Z t -2 



the periodogram of the centered series X — IX, because the average involves nonnatural 
frequencies. In view of (11.3) this estimator can be written in the form 



J 2?r life. 



\h\<n 

where w(h) = J e lwh W (ui) dco are the Fourier coefficients of the weight function. Thus 
we have arrived at a generalization of the estimator (11.1). If we choose w(h) = 1 for 
\h\ < r and w(h) = otherwise, then the preceding display exactly gives (11.1). The more 
general form can be motivated by the same reasoning: the role of the coefficients w(h) is 
to diminish the influence of the relatively unrealiable estimators 7™(/i) (for httn), when 
plugging in these sample estimators for the true auto-covariances in the expression for 
the spectral density. Thus, the weights w(h) are typically chosen to decrease in absolute 
value from |w(0)| = 1 to |w(n)| = if ft increases from to n. 

The function W is known as the spectral window; its Fourier coefficients w(h) are 
known as the lag window, tapering function or convergence factors. The last name comes 
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from Fourier analysis, where convergence factors were introduced to improve the ap- 
proximation properties of a Fourier series: it was noted that for suitably chosen weights 
w(h) the partial sums ^2\ h \ <n w(h) r yx(h)e ~ lhX could be much closer to the full series 
^ h "1x{h)e~ lhX than the same partial sums with w = 1. In our statistical context this is 
even more so the case, because we introduce additional approximation error by replacing 
the coefficients 'jx(h) by the estimators %(h). 

11.7 Example. The tapering function 

, M JO it\h\<r, 
w ^ = {l it\h\>r, 

corresponds to the Dirichlet kernel 

|/i|<r A 

Therefore, the estimator (11.1) should be compared to the estimators (11.5) and (11.6) 
with weights Wj chosen according to the Dirichlet kernel. □ 

11.8 Example. The uniform kernel 

v ' \ if |A| > n/r, 

corresponds to the weight function w(h) = r sm(n h)/(nh). These choices of spectral and 
lag windows correspond to the estimator (11.4). □ 

All estimators for the spectral density considered so far can be viewed as smoothed 
periodograms: the value /(A) of the estimator at A is an average or weighted average of 
values I n {lJ) for [i in a neighbourhood of A. Thus "irregularities" in the periodogram 
are "smoothed out" . The amount of smoothing is crucial for the accuracy of the estima- 
tors. This amount, called the bandwidth, is determined by the parameter k in (11.4), the 
weights Wj in (11.5), the kernel W in (11.6), and, more hidden, by the parameter r in 
(11.1). For instance, a large value of k in (11.4) or a kernel W with a large variance in 
(11.6) result in a large amount of smoothing (large bandwidth). Over smoothing, choos- 
ing a bandwidth that is too large, results in spectral estimators that are too flat and 
therefore inaccurate, whereas undersmoothing, choosing too small a bandwidth, yields 
spectral estimators that share the bad properties of the periodogram. In practice an "op- 
timal" bandwidth is often determined by plotting the spectral estimators for a number of 
different bandwidths and next choosing the one that looks "reasonable" . An alternative 
is to use one of several methods of "data-driven" choices of bandwidths, such as cross 
validation or penalization. We omit a discussion. 

Theoretical analysis of the choice of the bandwidth is almost exclusively asymp- 
totical in nature. Given a number of observations tending to infinity, the "optimal" 
bandwidth decreases to zero. A main concern of an asymptotic analysis is to determine 
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the rate at which the bandwidth should decrease as the number of observations tends 
to infinity. The key concept is the bias-variance trade-off. Because the periodogram is 
more or less unbiased, little smoothing gives an estimator with small bias. However, as 
we have seen, the estimator will have a large variance. Much smoothing has the opposite 
effects. Because accurate estimation requires that both bias and variance are small, we 
need an intermediate value of the bandwidth. 

We shall quantify this bias- variance trade-off for estimators of the type (11.1), where 
we consider r as the bandwidth parameter. As our objective we take to minimize the 
mean integrated square error 

2ttE f \f n , T {\) - fx(X)\ 2 d\. 

The integrated square error is a global measure of the discrepancy between /„ jT . and 
fx- Because we are interested in fx as a function, it is more relevant than the distance 
|/n,r(A) - /x(A)| for any fixed A. 

We shall use Parseval's identity, which says that the space L 2 (— n, n] is isometric to 
the space l-i- 

11.9 Lemma (Parseval's identity). Let f: (—it, it] — > C be a measurable function such 
that J |/| 2 (A) dX < 00. Then its Fourier coefficients fj = /^ e yA /(A) dX satisfy 



f 



.. OO 

\f(X)\ 2 dX=- £ \f 



2tt 

3= — 00 



00 

2 



11.10 EXERCISE. Prove this identity. Also show that for a pair of square-integrable, 
measurable functions /, g: (— n, n] — > C we have / f(X)g(X) dX = ^2- fjQj- 

The function /„ jT . — fx possesses the Fourier coefficients %(h) — 'jx(h) for \h\ < r 
and — 'jx(h) for \h\ > r. Thus, Parseval's identity yields that the preceding display is 
equal to 

Ej2\%(h)- lx (h)\ 2 + £|7x(/>)| 2 

\h\<r \h\>r 

In a rough sense the two terms in this formula are the "variance" and the "bias" term. A 
large value of r clearly decreases the second, bias term, but increases the first, variance 
term. This variance term can itself be split into a bias and variance term and we can 
reexpress the mean integrated square error as 

Y, var7„(/i) + Y, |E7„(ft) - lx{h)\ 2 + ^ \lx{h)\ 2 . 

\h\<T \h\<r \h\>r 

Assume for simplicity that EX t = and that X t = Ylj i>jZt-j for an absolutely converg- 
ing series X^V'j an d i-i-d- sequence Z t with finite fourth moments. Furthermore, assume 
that we use the estimator 7„(/i) = n _1 Ylt=i ^t+hX t rather than the true sample auto- 
covariance function. (The results for the general case are similar, but the calculations 
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will be even more involved than they already are. Note that the difference between the 
present estimator and the usual one is approximately X and ^\ h \ <r ^{X) A = 0{r/n 2 ). 
This is negligible in the following.) Then the calculations in Chapter 5 show that the 
preceding display is equal to 

£ — £ ( n ~ \ h \ ~ ISI) [ K 40- 4 ^ i ) ii>i+hi>i+gi>i+g+h + i 2 x (s) 



TV 

\h\<T \g\<n-\h\ 



\h\. 



+ 7x(g + h) 7x (g-h)] + Y, (V 7jc(/l) ) + E Tx(h) 



n 

\h\<r \h\>r 

I I 4 cy 

- ~^~~~ E E E l^i+h^Pi+g^Pi+g+h | + -£■ ^2 IX (s) 

h g i g 

+ ^12121^(9 + hhx(g-h)\ + r ^^x(h)+Y,^x(h). 

h g h \h\>r 

To ensure that the last term on the right converges to zero as n — > oo we must choose 
r = r n — > oo. Then the second term on the right converges to zero if and only if r n /n — > 0. 
The first and third term are of the order 0(1 jn), and the fourth term is of the order 
0(r^/n 2 ). Under the requirements r n — > oo and r n /n — > these terms are dominated 
by the other terms, and the whole expression is of the order 



r 

n 



\h\>T„ 



A first conclusion is that the sequence of estimators f n ,r„ is asymptotically consistent 
for estimating fx relative to the I/2-distance whenever r n — > oo and r n /n — > 0. A wide 
range of sequences r n satisfies these constraints. For an optimal choice we must make 
assumptions regarding the rate at which the bias term ^\ h \ >r lx{h) converges to zero 
as r — > oo. For any constant m we have that 



T + E7*(A) a <£ + il>*(W a " 



n , ^ n 

h>r 



Suppose that the series on the far right converges; this means roughly that the auto- 
covariances 'yx(h) decrease faster than |/i|~ m ~ 1//2 as \h\ —> oo. Then we can make a 
bias-variance trade-off by balancing the terms r n /n and l/r 2m . These terms are of equal 
order for r n = n 1 ^ 2m+1 '; for this choice of r n we find that 

E f |/„,r.(A) - fx(\)\ 2 dX = 0{n- 2m ^ 2m+1 ^. 

Large values of m yield the fastest rates of convergence. The rate ri - m /( 2m + 1 ) [ s always 
slower than n -1 / 2 , the rate obtained when using parametric spectral estimators, but 
approaches this rate as m — > oo. 
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Unfortunately, in practice we do not know 'yx (h) and therefore cannot check whether 
the preceding derivation is valid. So-called cross-validation techniques may be used to 
determine a suitable constant ro from the data. 

The condition that X^^7x(' l )' l2m < °° can be interpreted in terms of the smooth- 
ness of the spectral density. By differentiating the series /x(A) = (27r) _1 ^ h lx{h)e ~ lhX 
repeatedly we obtain that the roth derivative of fx is given by 

h 

This shows that the numbers 'yx(h)(—ih) m are the Fourier coefficients of fx ■ Conse- 
quently, by Parseval's identity 

Y,i\{h)h^= r f { x\x) 2 d\. 
h j - k 

Thus the left side is finite if and only if the roth derivative of fx exists and is square- 
integrable. We say that fx is m-smooth. For time series with an ro-smooth spectral 
density, one can estimate the spectral density with an integrated square error of order 
Q^ n -2m/(2m+i)y rp n j s ra ^ e j g um f orm over the set of all time series with spectral densities 

such that J fxW 2 d\ is uniformly bounded. 

This conclusion is similar to the conclusion in the problem of estimating a density 
given a random sample from this density, where also ro-smooth densities can be estimated 
with an integrated square error of order 0(n~ 2m /( 2m+1 )). The smoothing methods dis- 
cussed previously (the estimator (11.6) in particular) are also related to the method of 
kernel smoothing for density estimation. It is interesting that historically the method of 
smoothing was first applied to the problem of estimating a spectral density. Here kernel 
smoothing of the periodogram was a natural extension of taking simple averages as in 
(11.4), which itself is motivated by the independence property of the periodogram. The 
method of kernel smoothing for the problem of density estimation based on a random 
sample from this density was invented later, even though this problem by itself appears 
to be simpler. 



11.4 Estimating a Spectral Distribution 

In the preceding section it is seen that nonparametric estimation of a spectral density 
requires smoothing and yields rates of convergence n~ a for values of a < 1/2. In contrast, 
a spectral distribution function can be estimated at the "usual" rate of convergence 
n -1 / 2 and natural estimators are asymptotically normally distributed. We assume X t is 
a stationary time series with spectral density fx- 

The spectral distribution function Fx(Xo) = /_° /x(A) can be written in the form 



F 



a(\)f x (\)d\ 
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for a the indicator function of the interval (— n, Ao]. We shall consider estimation of a 
general functional of this type by the estimator 



J„(a):= f a{X)I n {X)dX. 

J— 7T 



11.11 Theorem. Suppose that X t = YljrfjZt-j for an i.i.d. sequence Z t with Unite 
fourth cumulant K4 and constants ipj such that Ylj iV'jl < °°- Moreover, assume that 
^2h H7x W < °°- Then, for any symmetric function a such that J* a 2 (A) dX < oo, 

^H(l n {o) - J af x c/A) - JV(0, k 4 ( f af x dxf + 4tt f a 2 f 2 x dx) . 

Proof. We can expand o(A) in its Fourier series a(X) = Ylj Oje~ yA (say). By Parseval's 
identity 



/ 



h 

Similarly, by (11.2) and Parseval's identity 

I n (a) = aI n dX= ^2 %( h ) a h- 



\h\<r 

First suppose that a^ = for \h\ > m and some m. Then J al n dX is a linear combination 
of (7n(0), ■ • • , 7„(m)). By Theorem 5.7, as n — > oo, 

Vn(^2%(h)a h -^2^x(h)a h J ~* ^a h Z h , 

h h h 

where (Z- m , . . . , Zq, Z\ , . . . , Z m ) is a mean zero normally distributed random vector such 
that (Zo, • • • , Z m ) has covariance matrix V as in Theorem 5.7 and Z-h = Z^ for every 
h. Thus ^2 h dhZh is normally distributed with mean zero and variance 

l2 zZ V s,hd g a h = k 4 (^2 a s^x (s)J + ^2 zZ [zZ 7 * ( fc + h ^ x ( fc + 9 "> 

g h g g h k 

(11.7) +Y^7x(k + h)~f X (k-g))a g a h 

k 

= k 4 U afxdXJ +4tt a 2 f x dX. 

The last equality follows after a short calculation, using that ah = a-h- (Note that we 
have used the expression for V Si h given in Theorem 5.7 also for negative g or h, which is 
correct, because both cov(i? s , Zh) and the expression in Theorem 5.7 remain the same if 
g or h is replaced by —g or — h.) 

This concludes the proof in the case that a^ = for \h\ > ra, for some m. The 
general case is treated with the help of Lemma 3.10. Set a m = Yl\j\< m a j e ~ lX ^ an d apply 
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the preceding argument to X ny7n : = yfn J a m (I n — fx) dX) to see that X nym ~-> N(0, ct^J 
as n — > oo, for every fixed m. The asymptotic variance ct^ is the expression given in the 
theorem with a m instead of a. If m — > oo, then ct^ converges to the expression in the 
theorem, by the dominated convergence theorem, because a is squared-integrable and 
fx is uniformly bounded. Therefore, by Lemma 3.10 it suffices to show that for every 
m n — > oo 

(11-8) V^J(a-a m J(I n -f x )d\E > 0. 

Set b = a — a mn . The variance of the random variable in (11.8) is the same as the variance 
of J (a — a mn )I n dX, and can be computed as, in view of Parseval's identity, 

var(— ^2 7n{h)bhj =^j X! E cov (%(g),%(h))b g b h 

\h\<n | 9 |<n|h|<n 

n-\g\n-\h\ 

= ^E Y tf Y Y coy(x s+g x s ,x t+h x t )b g b h . 

\g\<n\h\<n s=l t=l 

Using the same approach as in Section 5.2, we can rewrite this as 

n-\g\n-\h\ 

i^ Y Y Y Y {^^Y^+a^^t-^+h+^ts+i 

\g\<n\h\<n s=l t=l i 

+ 7x(t- s + h- g)^ x (t- s)+-y x (t- s- g)^ x (t-s + h)J b g b h . 
The absolute value of this expression can be bounded above by 



4tt 2 

g h k 

+ 



^YYYiY \' l Pi+g' l P^' l Pk+h+i'^Pk+i\\K^\a 4 + \-y x (k + h - g)jx(k)\ 

k i 

lx(k - g)jx(k + h)\J\b g b h \ 

=i?(\ k 4i bj =x dx y +4 ^ 2 i& dx )> 

by the same calculation as in (11.7), where we define 

& = £Me-^, i x w = j2Y\^^\° 2e ~ iXh > i x w = YhxW\ e 

h hi h 

Under our assumptions / and / are bounded functions. It follows that var J b n I n dX 

if J b^ dX — > 0. This is true in particular for b n = a — a mn . 

Next the mean of the left side of (11.8) can be computed as 

V^(E Vln{h)b h - jbfxdx) = yfr(Y * 1Z ^-7x(h)b h -Y,7x(h)b h ) 



— i\h 



\h\<n \h\<n 
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By the Cauchy-Schwarz inequality this is bounded in absolute value by the square root 
of 2 

£ N 2 E ij ^ L ^(h) < £ N 2 £ |«(A). 

h h h h 

Under our assumptions this converges to zero as J 6 2 dA — > 0. ■ 

The preceding theorem is restricted to symmetric functions a, but can easily be 
extended to general functions, because by the symmetry of the spectral density 

Ja(X)f x (X)dA = J aW+ 2 a{ - X) fx(X)dX. 

11.12 EXERCISE. Show that for a possibly nonsymmetric function a the theorem is 
valid, but with asymptotic variance 

kJ[ af x dX) + 2tt f a 2 ft dX + 2n f a(X)a(-X)ft (A) dX. 

11.13 Example. To obtain the limit distribution of the estimator for the spectral dis- 
tribution function at the point Ao € [0,7r], we apply the theorem with the symmetric 
function a = (1(-tt,a ] + 1(-a ,7t])- The asymptotic variance is equal to K4-Fx(Ao) 2 + 
4tt J A A ; ft dX + 2tt /; o ft dX for Ao e [0, tt]. d 

11.14 Example. The choice a(A) = cos(/iA) yields the estimator (for < h < n) 

[ cos(hX)I n (X) dX = Re f e ihX I n (X) dX = - £ X t+h X t 
J J n t=\ 

of the auto-covariance function 'yx(h) in the case that EX t = 0. Thus the preceding 
theorem contains Theorem 5.7 as a special case. The present theorem shows how the 
asymptotic covariance of the sample auto-covariance function can be expressed in the 
spectral density. □ 

11.15 EXERCISE. Show that the sequence of bivariate random vectors y/n(J a(I n — 
fx) dX, J b(I n — fx) dX) converges in distribution to a bivariate Gaussian vector (G a , Gb) 
with mean zero and EG a Gb = «4 / afx dX J bfx dX + 4n J ah ft dX. 

11.16 EXERCISE. Plot the periodogram of a white noise series of length 200. Does this 
look like a plot of 200 independent exponential variables? 

11.17 EXERCISE. Estimate the spectral density of the simulated time series given in 
the file sda/Cursusdata/sim2 by a smoothed periodogram. Compare this to the estimate 
obtained assuming that sim2 is an AR(3) series. 

11.18 EXERCISE. Estimate the spectral density of the Wolfer sunspot numbers (the 
object sunspot s in Splus) by 
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(i) a smoothed periodogram; 

(ii) the spectral density of an appropriate AR-model. 
Note: the mean is nonzero. 



12 

Maximum Likelihood 



The method of maximum likelihood is one of the unifying principles of statistics, and 
applies equally well to models for replicated experiments as to time series models. Given 
observations X\,... ,X n with a joint probability density (xi,...,x„) i-> p n ,o(xi, ■ ■ ■ ,x n ) 
that depends on a parameter 6, the likelihood function is the function 

8>-tp nt e(X 1 ,...,X n ). 

The maximum likelihood estimator for 8, if it exists, is the value of 6 that maximizes the 
likelihood function. 

The likelihood function corresponding to i.i.d. observations X\, . . . , X n is the prod- 
uct of the likelihood functions of the individual observations, which makes likelihood 
inference relatively easy. For time series models the likelihood function may be a more 
complicated function of the observations and the parameter, and this causes problems for 
practical implementation of likelihood inference as well as for theoretical analysis. How- 
ever, in "most" situations the final results are not that different from the more familiar 
i.i.d. case. In particular, maximum likelihood estimators are typically -y/n-consistent and 
possess a normal limit distribution, with mean zero and covariance the inverse of a certain 
"Fisher information matrix" . 

In this chapter' 1 ' we study the maximum likelihood estimator, and some approxima- 
tions. We also consider the effect of misspecification: using the likelihood for a model that 
does not contain the distribution of the data. Such misspecification of the model may be 
unintended, but is sometimes the result of a conscious choice. For instance, the likelihood 
under the assumption that X\ , . . . , X n is part of a stationary Gaussian time series X t 
is popular for inference, even if one may not believe that the time series is Gaussian. 
The corresponding maximum likelihood estimator is closely related to the least squares 
estimators and turns out to perform well also for a wide range of non-Gaussian time 
series, and is thus of special interest. Another example is to postulate that the inno- 
vations in a GARCH model are Gaussian, even though we may not believe strongly in 



' Chapter still under construction! 
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this assumption. The resulting estimators again work well also for non-Gaussian innova- 
tions. A misspecified likelihood is also referred to as a quasi likelihood and the resulting 
estimators are quasi likelihood estimators. 



12.1 General Likelihood 

A convenient representation of a likelihood is obtained by repeated conditioning. To 
alleviate the notation we abuse notation by writing p(y\ x) for a conditional density of 
a variable Y given that another variable X takes the value x, and denote the marginal 
density of X by p(x). Thus we write a likelihood as 6 i-> pe{x\ , . . . , x„), and this can be 
decomposed as 

(12.1) 6 *->pe(xi,...,x n ) =Pe(xi)p s (x2\xi)---pg(x n \x n -i,...,xi). 

Clearly we must select appropriate versions of the (conditional) densities, but we shall 
not worry about technical details in this section. 

The decomposition resembles the factorization of the likelihood of an i.i.d. sample 
of observations, but an important difference is that the n terms on the right may all be 
of a different form. Even if the time series X t is strictly stationary, each further term 
entails conditioning on a bigger past and hence is potentially of a different character 
than the earlier terms. However, in many examples the "distant past" (X s :s <C t) is 
nearly independent of the present X t given the "near past" (X s : s < t,s rj t). Then the 
likelihood does not change much if the conditioning in each term is limited to a fixed 
number of variables in the past, and the terms of the product will take almost a common 
form. Alternatively, we may augment the conditioning in each term to include the full 
"infinite past" , yielding the "pseudo likelihood" 

(12.2) 6 i-> p q (xi\x , x-i,.. .)pg(x2\xi,x , ■■■)■ ■■pg(x n \x n -i,x n -2, ■ ■ ■)• 

If the time series X t is strictly stationary, then the ith term in this product is a fixed 
measurable function, independent of t, applied to the vector (x t ,x t -i, . . .). In particular, 
the terms of the product form a strictly stationary time series, which will be ergodic if 
the original time series X t is ergodic. This is almost as good as the i.i.d. terms obtained 
in the case of an i.i.d. time series. 

The pseudo likelihood (12.2) cannot be used in practice, because the "negative" 
variables X , X-i, . . . are not observed. However, the preceding discussion suggests that 
the "pseudo maximum likelihood estimators" , defined as the maximizers of the pseudo 
likelihood, may behave the same as the true maximum likelihood estimators. Moreover, 
if it is true that the past observations Xo,X-i,... do not play an important role in 
defining the pseudo likelihood, then we could also replace them by arbitrary values, for 
instance zero, and hence obtain an observable criterion function. 
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12.1 Example (Markov time series). If the time series X t is Markov, then the con- 
ditioning in each term pe(x t \ x t -i, ■ ■ ■ , x\) or po(xt\x t -i,x t -2, ■ ■ •) can be restricted to 
a single variable, the variable Xt-i- In this case the likelihood or pseudo likelihood the 
likelihood and the pseudo likelihood differ only in the first terms, which are pe{x\) and 
Pe{x\\xo), respectively. This difference should have negligible effect if n is large. 

Similarly, if the time series is Markov of order p, i.e. p(x t \x t -i,x t -2, ■ ■ ■) depends 
only on x t ,x t -i, ■ ■ ■ , x t - p , then the two likelihoods differ only in p terms. This should be 
negligible if n is large relative to p. 

A causal auto-regressive time series defined relative to an i.i.d. white noise series is an 
example of this situation. Maximum likelihood estimators for auto-regressive processes 
are commonly defined by using the pseudo likelihood with X , . . . , X- p+ \ set equal to 
zero. Alternatively, one simply drops the first p terms of the likelihood, and works with 
the approximate likelihood 

TL 

(a, cj)i , . . . , (j) p ) i-> Jl p z ,a (X t - 01-Xj-i <j) p X t _ p ) , 

t= P +i 

for p Zy a the density of the innovations. This can also be considered a conditional likeli- 
hood given the observations Xi,... ,X P . The difference of this likelihood with the true 
likelihood is precisely the marginal density of the vector (X\, . . . ,X P ), which is compli- 
cated in general, but should have a noticable effect on the maximum likelihood estimator 
only if p is large relative to n. □ 

12.2 Example (GARCH). A strictly stationary GARCH process X t relative to an 
i.i.d. series Z t can be written as X t = a t Z t , for of = E(A" t 2 | Tt-i) and Tt the filtration 
generated by X t ,X t -i, — From Theorem 8.10 it is known that the filtration T t is 
also the natural filtration of the process Z t and hence the variable Z t is independent 
of of, which is measurable relative to Tt-\- It follows that the conditional distribution 
of X t given X t -i, X t -2, ... is obtained by first calculating of from X t ~i, X t -2, ■ ■ ■ and 
next multiplying a t by an independent variable Z t . If Pz is the marginal density of the 
variables Z t , then the pseudo likelihood (12.2) takes the form 

nM-)- 

The parameters a,ij>i,...ij> p ,Gi,...,G q are hidden in the variables a t , through the 
GARCH relation (8.1). This formula is not the true likelihood, because it depends on 
Xq, X-i, . . . through the a t . 

For an ARCH(g) process the conditional variances of depend only on the variables 
X t -i, . . . ,X t - q , in the simple form 

a? = a + 0iX t -i + ■■■ + «-,■ 

In this case the true likelihood and the pseudolikelihood differ only in the first q of the 
n terms. This difference should be negligible. For practical purposes we could both drop 
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those first q terms, giving a conditional likelihood, or act as if the unobserved variables 
Xq, ... , X^_ +1 are zero. 

For general GARCH processes the difference between the likelihoods is more sub- 
stantial. However, the dependence of of on lagged variables X^ decreases exponentially 
fast as s — > oo, at least in the case of a second order stationary GARCH series. That the 
variables X^ with s < do not play an important role in the definition of the (pseudo) 
likelihood is also suggested by Theorem 8.14, which shows that a GARCH series defined 
from arbitrary starting values converges to (strict) stationarity as t grows to infinity 
(provided that a strictly stationary GARCH process exists). Thus we might again use 
the pseudo likelihood with the unobserved variables X^ replaced by zero. 

A practical implementation is to define <Jq,.. .,<j 2 _ p+l and Xq,...,X^_ +1 to be 
zero, and next compute o\,o\,... recursively, using the GARCH relation (8.1) and the 
observed values X\,. .. ,X n . Note that these zero starting values cannot be the true 
values of the series if the series is strictly stationary. □ 

To gain insight in the asymptotic properties of the maximum likelihood estimators, 
we adopt the working hypothesis that these have the same asymptotic properties as the 
maximum pseudo likelihood estimators. Furthermore, we assume that the time series 
X t is strictly stationary and ergodic. These conditions are certainly too stringent, but 
they simplify the arguments. The conclusions typically apply to any time series that 
"approaches stationarity" as t — > oo and for which averages converge to constants. 

Abbreviate x t , x t -i, ■ ■ . to x t . The maximum pseudo likelihood estimator maximizes 
the function 

1 ™ 
(12.3) 9 .-> M n {9) = -^logwCXtl^t-i). 

t=i 

If the variables \ogpe{X t \ Xt-i) are integrable, as we assume, then, by the ergodic the- 
orem, Theorem 4.18, the averages M n {9) converges to their expectation 

M(9)=Ee \ogp e (X 1 \X ). 

The expectation is taken under the "true" parameter O governing the distribution of 
the time series X t . The difference of the expected values M(0 O ) and M{9) can also be 
written as 

M(9 ) - M{9) = E f(\og Pe ^ Xl }l o) )pe {x 1 \X )d^x 1 ). 

The integral inside the expectation is the Kullback-Leibler divergence between the (condi- 
tional) measures with densities pe(-\xo) andpe (-\ xo). It is well known that the Kullback- 
Leibler divergence between two probability measures is nonnegative and is zero if and 
only if the two measures are the same. Thus M{9) < M{9q) for every 9 with equality if 
and only if the two conditional measures are the same. Under the reasonable assump- 
tion that each value of 9 indexes a different underlying distribution of the time series 
X t ( "identifiability of 9"), we conclude that the map 9 i-> M{9) possesses an absolute 
maximum at 9 = 9q . 
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The convergence of the criterion function M n to M, and the definitions of 6 n and 
6q as the points of maxima of these functions suggests that 6 n converges to 6q. In other 
words, we expect the maximum likelihood estimators to be consistent for the "true" 
value #o- The argument can be made mathematically rigorous, for instance by imposing 
additional conditions that guarantee the uniform convergence of M n to M. See ? 

To establish the asymptotic normality of 6 n , we assume that the gradient M n (6) 
and second derivative matrix M n {6) of the map 6 i-> M n (6) exist and are continuous. 
Because 6 n is a point of maximum of M n , it satisfies the stationary equation M n (6 n ) = 0. 
By Taylor's theorem there exists a point 6 n on the line between 6 and 6 n such that 

= M„(0„) = M n (9 ) + M n (6 n )(h ~ 0). 

By simple algebra this can be rewritten as 

Vn(4 - 6o) = -(M n (9 n )) _1 M n (ft)). 

Because 6 n E^. 6q if 6q is the true parameter, it is a reasonable assumption that the 
matrices M n (6 n ) and M n {6 ) possess the same limit, the convergence of the second 
one, which is an average, being guaranteed by the ergodic theorem. If we can also show 
that the sequence y/nM n (6o) converges in distribution, then we can conclude that the 
sequence \fn{6 n — 6q) converges in distribution, by Slutsky's lemma. 

The convergence of the sequence \fnM n {d ) can be established by the martin- 
gale central limit theorem, Theorem 4.28. To see this, first differentiate the identify 
f Pe(xi\ Xq) d>n{xi) = 1 twice to verify that 

/ pe{xi\xo) dfjt(xi) = / p e {xi\xo) dfjt(xi) = 0. 

The function £e(x t \ x t -i) = \ogpe{x t \xt-i) possesses partial derivatives relative to 6 
given by £e = pe/pe and te = pe/pe — to^e- Combination with the preceding display 
yields conditional versions of the usual identities "expectation of score function is zero" 
and "expectation of observed information is minus the Fisher information" , showing that 

E e (£ e (X 1 |X o )|X o )=0, 

Cov fl (^(Xi|^o)|^o) = -E fl (i fl i r (Xi|^o)|^o) =-Cov e (ie(X 1 \X )\X ). 

The first identity show that the sequence y/nM n (6) = n -1 / 2 Ylt=i ^e(X t \ Xt-i) is a mar- 
tingale under the true measure specified by the parameter 8. Under reasonable conditions 
the martingale central limit theorem yields that the sequence ^/nM n (6) is asymptotically 
normal with mean zero and covariance matrix 

Ie=Cov e (ie(X 1 \X )). 

By the second identity EgM n (6) = —Ig and hence M n (6) E^. — Ig, by the ergodic 
theorem. Combining this with Slutsky's lemma as indicated before, we find that, under 
true parameter 6 , 

MOn-e )^N(o,i 6a 1 ). 
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The matrix Ig is known as the Fisher information matrix. Typically, it can also be found 
as the limits 

I nfi = -Eg—\ogp e (X 1 ,...,X n )—\ogpg(X 1 ,...,X n ) T -t Ig, 

1 d 2 
~nW l ° gPe<yXl ''"' Xn \o=6- ~* Ie ' 

The first connects it to the definition of the Fisher information for arbitrary observa- 
tions. The expression on the left in the second line is the second derivative matrix of the 
likelihood surface at the maximum likelihood estimator, and is known as the observed 
information. It gives an estimate for the inverse of the asymptotic covariance matrix of 
the sequence \fn{Q n — 6). Thus a large observed information indicates that the maxi- 
mum likelihood estimator has small asymptotic covariance and hence is (asymptotically) 
efficient. 

The efficiency of the maximum likelihood estimator can be understood in an absolute 
sense by relating it to the Cramer-Rao bound for the variance of unbiased estimators. 
According to the Cramer- Rao theorem, the covariance matrix of any unbiased estimator 
T n of 6 satisfies 

Cov fl (v^(T„-0)) >/„| 

The preceding informal derivation suggests that the asymptotic covariance matrix of the 
sequence y/n(6 n — 6) is equal to Ig 1 . We interprete this as saying that the maximum 
likelihood estimator is asymptotically of minimal variance, or asymptotically efficient. 

It is possible to give a rigorous proof of the asymptotic normality of the maximum 
likelihood estimator, and also of a precise formulation of its asymptotic efficiency. See ? 

12.3 EXERCISE. Compute the maximum likelihood estimators for (6,o~ 2 ) in a station- 
ary, causal AR(1) model X t = 6X t _i + Z t with Gaussian innovations Z t . What is its 
limit distribution? Calculate the Fisher information matrix Ig. 

12.4 EXERCISE. Find the pair of likelihood equations M n (a, 8) = for estimating the 
parameters (a, 6) in an ARCH(l) model. Verify the martingale property of nM n (a,6). 



12.2 Misspecification 

Specification of a correct statistical model for a given time series is generally difficult, and 
it is typically hard to decide which of two given reasonable models is the better one. This 
observation is often taken as motivation for modelling a time series as a Gaussian series, 
Gaussianity being considered as good as any other specification and Gaussian likelihoods 
being relatively easy to handle. Meanwhile the validity of the Gaussian assumption is 
not really believed. It is therefore important, in time series analysis even more than in 
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statistics for replicated experiments, to consider the behaviour of estimation procedures 
under misspecification of a model. 

Thus consider an estimator 6 n defined as the point of maximum of a likelihood 
function of a model that possibly does not contain the true density of the observations. 
It is again easier to consider the pseudo likelihood (12.2) than the true likelihood. The 
misspecified maximum pseudo likelihood estimator is still the point of maximum of the 
map 9 i-> M n (6) defined in (12.3). For the asymptotic analysis of 6 n we again apply the 
ergodic theorem to see that M n (6) — > M(6) almost surely, for M(6) the expectation of 
M„(0), defined by 

M(0)=E£ e (X 1 \X o ). 

The difference with the foregoing is that presently the expectation is taken under the 
true model for the series X t , which may or may not be representable through one of 
the parameters 6. However, the same reasoning suggests that 6 n converges in probability 
to a value 6q that maximizes the map 6 i-> M n (6). Without further specification of the 
model and the true distribution of the time series, there is little more we can say about 
this maximizing value than that it gives conditional densities pe (-\xo) that are, on the 
average, closest to the true conditional densities p(-\ xo) of the time series in terms of the 
Kullback-Leibler divergence. 

Having ascertained that the sequence 6 n ought to converge to a limit, most of the 
subsequent arguments to establish asymptotic normality of the sequence \fn{6 n — 6q) go 
through, also under misspecification, provided that 

(12.4) E(4(X 1 |X )|J? )=0, a.s.. 

In that case the sequence y/nM n (6o) is still a martingale, and may be expected to be 
asymptotically normal by the martingale central limit theorem. By the assumed ergod- 
icity of the series X t the sequence M n (8 ) will still converges to a fixed matrix, and the 
same may be expected to be true for the sequence of second derivatives M n (6 n ) evaluated 
at a point between 6 n and 6q. A difference is that the asymptotic covariance matrix T,e 
of the sequence y/nM n (6o) and the limit Re of the sequence M n (6o) may no longer be 
each other negatives. The conclusion will therefore take the more complicated form 

y/E(6 n - O ) - N(0, i^ o 1 E flo (i^ o 1 ) T ). 

The asymptotic variance Rg 1 T,e (Rg 1 ) T is known as the sandwich formula. Thus the 

sequence 6 n will converge rapidly to a limit 6 , and "fitting the wrong model" will be 
useful as long as the density pe is sufficiently close to the true distribution of the time 
series. 

Condition (12.4) is odd, and not always satisfied. It is certainly satisfied if the point 
of maximum 6 of the map 6 i-> M(6) is such that for every x it is also a point of 
maximum of the map, with p the true conditional density of the time series, 



6 i-> / \ogpe(xi\x )p(xi\x )dn(xi). 
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This is not necessarily the case, as the points of maxima of the functions in the display 
may be different for different values of Xq. The point 6q is by definition the point of 
maximum of the average of these functions over x . Failure of (12.4) does not mean that 
the sequence \fn{6 n — 6q) is not asymptotically normally distributed, but it does mean 
that we cannot apply the martingale central limit theorem, as in the preceding argument. 
In the next section we discuss a major example of possible misspecification: esti- 
mating a parameter by Gaussian maximum likelihood. The following example concerns 
GARCH processes, and illustrates that some misspecifications are harmless, whereas 
others may cause trouble. 

12.5 Example (GARCH). As found in Example 12.2, the pseudo likelihood for a 
GARCH(p, q) process takes the form 

" 1 f X t 



n^(0) pz U(0)J 



where pz is the density of the innovations. In Chapter 8 it was noted that a i-density 
Pz may be appropriate to explain the observed leptokurtic tails of financial time series. 
However, the Gaussian density pz{z) = exp(— |^ 2 )/\/27r is popular for likelihood based 
inference for GARCH processes. The corresponding log pseudo likelihood is up to additive 
and multiplicative constants equal to 



6 



.. 71 .. 71 

-> -i£loga t 2 (0) - i£ 



x 



t=i t=i * v ' 

The expectation of this criterion function can be written as 

E(X 2 W 



M W = -E(loga 2 W + ^^). 



Both expectations on the right side are taken relative to the true distribution of the time 
series, and hence involves an unknown density pz and the parameter 6 . Suppose that the 
GARCH equation (8.1) for the conditional variances is correctly specified, even though 
the true density of the innovations may not be standard normal. Then E(A" 2 | To) = 
<7 2 (#o) for the true parameter 6 and hence 

*i 2 («W- 



M(0) = -E(loga 2 (0) + ^). 



Even though the expectation on the right involves the unspecified density pz, the map 
6 i-> M(6) is maximized at 6 = 6q no matter the nature of the density pz- This follows 
from the fact that, for every fixed <r 2 , the map a 1 i-> logu 2 +ct 2 /ct 2 assumes its minimal 
value on the domain (0, oo) at ct 2 = ct 2 . 

We conclude that the use of the Gaussian density for pz will lead to consistent esti- 
mators 6 n for the coefficients of the GARCH equation, for any density pz- The preceding 
argument shows that this pleasant fact is the result of the fact that the likelihood based 
on choosing the normal density for pz depends on the observations X t only through a 
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linear function of the squares Xf. For another choice of density pz, such as a t-density, 

this is not true, and we cannot hope to be guarded against misspecification in that case. 

The maximizing parameter 6 in this model can be seen to yield a conditional density 

Pe (-\xo,X-i, . . .) that is closest to the true conditional density p(-\xo,x-i, .. .) relative 

to the Kullback-Leibler divergence, for any given values xq,X-\, This implies that 

equation (12.4) is satisfied in this case, and hence we expect the sequence ^/n(8 n — 6q) 
to be asymptotically normal, with asymptotic variance given by the sandwich formula. 
□ 



12.3 Gaussian Likelihood 

A Gaussian time series is a time series X t such that the joint distribution of every finite 
subvector (X tl ,...,X t „) of the series possesses a multivariate normal distribution. In 
particular, the vector {X\,.. .,X n ) is multivariate normally distributed, and hence its 
distribution is completely specified by a mean vector /j, n e W 1 and a (n x n) covariance 
matrix Y n . If the time series X t is covariance stationary, then the matrix Y n has entries 
(r n ) Sj t = 7x(s — t), for 7x the auto-covariance function of X t . We assume that both 
the mean [i n and the covariance function ^ x can be expressed in a parameter 6 of fixed 
dimension, so that we can write [x n = ^ n {6) and Y n = r n (8). 

The likelihood function under the assumption that X t is a Gaussian time series is 
the multivariate normal density viewed as function the parameter and takes the form 

(12 5) 6 ^ 1 _ _ - 1 p -^(x n -^(e)) T r„(e)- 1 (x„-, i „(e)) 

(27T)"/ 2 y/detr n (6) 

We refer to this function as the Gaussian likelihood, and to its point of maximum 6 n , 
if it exists, as the maximum Gaussian likelihood estimator. The Gaussian likelihood and 
the corresponding estimator are commonly used, also in the case that the time series X t 
is non-Gaussian. 

Maximum Gaussian likelihood is closely related to the method of least squares, 
described in Section 10.3. We can see this using the likelihood factorization (12.1). 
For a Gaussian process the conditional densities pe(x t \ -Xt-i, • • • ,Xi) are univariate 
normal densities with means Eg(X t \X t -i,. .. ,Xi) and variances v t -i(6) equal to 
the prediction errors. (Cf. Exercise 12.6.) Furthermore, the best nonlinear predictor 
Ee(X t \ X t -i, . . . , X\) is automatically a linear combination of the predicting variables 
and hence coincides with the best linear predictor TL t -iX t {6). This shows that the fac- 
torization (12.1) reduces to 
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Maximizing this relatively to 9 is equivalent to maximizing its logarithm, which can be 
written in the form 



(12.6) 9 » -flog^) - l£log^ l( 0) -ht ^-^- 1 ^ 2 . 

This function differs in form from the least squares criterion function (10.2) only in the 
presence of the function 9 *-> — \ Y^t=\ ^°& v t-i(6)- I n situations where this function is 
almost constant least squares and Gaussian maximum likelihood estimators are almost 
the same. 

12.6 EXERCISE. Suppose that the vector (X\, ... , X t ) possesses a multivariate normal 
distribution. Show that the conditional distribution of X t given (X\, . . . ,X t -i) is nor- 
mal with mean E(X t \Xi, . . . , X t -i) and variance E(X t - E(X t \Xi, . . . , X t -i)) . [Write 
X t = (X t - Y) + Y for Y = E(X t \Xi, . . . , X t -i), prove that Y is a linear function of 
Xi,... ,X t ~i, and conclude that the vector (X t — Y,Y) is bivariate normal with corre- 
lation zero. Conclude that X t — Y and Y are independent.] 

12.7 Example (Auto regression). For causal stationary auto-regressive processes of 
order p the innovations X t — II t _iA" t are equal to the noise input Z t for t > p, and 
the prediction errors v t are equal to a 2 = Ei? t 2 +1 for t > p. Thus the function 6 i-> 
— I Ylt=i l°& v t-i(6) in the formula for the Gaussian likelihood is approximately equal 
to — |n log a 2 , if n is much bigger than p. The log Gaussian likelihood is approximately 
equal to 



-log(27r)-inloga 2 -i }_, ~ ~2 " ~ 



2 



2 

t= P +i 

For a fixed a 2 maximization relative to </>i , . . . , <j> p is equivalent to minimization of the 
sum of squares and hence gives identical results as the method of least squares discussed 
in Sections 10.1 and 10.3. Maximization relative to a 2 gives (almost) the Yule- Walker 
estimator discussed in Section 10.1. □ 

12.8 Example (ARMA). In ARMA models the parameter a 2 enters as a multiplicative 
factor in the covariance function (cf. Section 10.3). This implies that the log Gaussian 
likelihood function can be written in the form, with 6 = (cf>i , . . . , (f> p , 9\ , . . . , 9 q ), 

n i /n \ 1 i 2 lV^i ~ /n\ l V^ fit - U t -lX t (6)) 

- - log(27r) - \n log o 2 - \ Y, log v t -! (9) - \ £ ^ . 

t=i t=i t i\ j 

Differentiating this with respect to a 2 we see that for every fixed 9, the Gaussian likeli- 
hood is maximized relative to a 2 by 

2 



a{()) = nh v^(9) • 
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Substituting this expression in the log Gaussian likelihood, we see that the maximum 
Gaussian likelihood estimator of 6 maximizes the function 

n 

6 -> -£log(27r) - \n log a 2 (6) - \ ^logtJt-i(fl) - \n. 

1 t=i 

The latter function is called the profile likelihood for 6, and the process of eliminating the 
parameter a 2 is referred to as concentrating out this parameter. We can drop the con- 
stant terms in the profile likelihood and conclude that the maximum Gaussian likelihood 
estimator 6 for 6 minimizes 

The maximum Gaussian likelihood estimator for a 2 is a 2 (8) . 

For causal, invertible stationary ARMA processes the innovations X t — U t ^iX t 
are for large t approximately equal to Z t , whence v t -i(0) ~ EZ 2 /a 2 = 1. (Cf. the 
discussion in Section 7.4. In fact, it can be shown that \v t -i — 1| < c* for some < 
c < 1 and sufficiently large t.) This suggests that the criterion function (12.7) does 
not change much if we drop its second term and retain only the sum of squares. The 
corresponding approximate maximum Gaussian likelihood estimator is precisely the least 
squares estimator, discussed in Section 10.3. □ 

12.9 Example (GARCH). The distribution of a GARCH process X t = a t Z t depends 
on the distribution of the innovations Z t , but is rarely (or never?) Gaussian. Nevertheless 
we may try and apply the method of Gaussian likelihood. 

Because a GARCH series is a white noise series, the linear one-step ahead predictors 
are identically zero, and the prediction variances are equal to the variances v 2 _ x = EX 2 
of the process. For a stationary GARCH process these are constant and can be expressed 
in the parameters of the GARCH process. For instance, for the GARCH(1, 1) process we 
have that EA" t 2 = a/(l — (f> — 0). Because the predictions are zero, the Gaussian likelihood 
depends on the parameters of the model only through the prediction variances v 2 _ x = 
EX 2 . It follows that the likelihood is constant on sets of constant prediction variance 
and hence can at best yield good estimators for functions of this variance. The GARCH 
parameters cannot be recovered from this. For instance, we cannot estimate (a, cf>, 6) 
from a criterion function that depends on these parameters only through a/(l — <j> — 8). 

We conclude that the method of Gaussian likelihood is useless for GARCH processes. 
(We note that the Gaussian likelihood is similar in form to the likelihood obtained 
by assuming that the innovations Z t are Gaussian (cf. Example 12.2), but with the 
conditional variances of in the latter replaced by their expectations.) □ 

In the preceding examples we have seen that for AR and ARMA processes the 
Gaussian maximum likelihood estimators are, asymptotically as n — > oo, close to the 
least squares estimators. The following theorem shows that the asymptotic behaviour 
of these estimators is identical to that of the least squares estimators, which is given in 
Theorem 10.16. 
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12.10 Theorem. Let X t be a causal, invertible stationary ARMA(p, q) process relative 
to an i.i.d. sequence Z t . Then the Gaussian maximum likelihood estimator satisfies 



^((tHt)H ^ 



where Jr g is the covariance matrix of (U-i , . . . , U- p , V-i , . . . , V- q ) for stationary auto- 
regressive processes U t and V t satisfying </>(B)U t = 8(B)V t = Z t . 

Proof. The proof is long and technical. See Brockwell and Davis (1991), pages 375-396, 
Theorem 10.8.2. ■ 

The theorem does not assume that the time series X t itself is Gaussian; it uses 
the Gaussianity only as a working hypothesis to define maximum likelihood estimators. 
Apparently, using "the wrong likelihood" still leads to reasonable estimators. This is 
plausible, because Gaussian maximum likelihood estimators are asymptotically equiva- 
lent to least squares estimators and the method of least squares can be motivated without 
reference to Gaussianity. Alternatively, it can be explained from a consideration of the 
Kullback-Leibler divergence, as in Section 12.2. 

On the other hand, in the case that the series X t is not Gaussian, then the true 
maximum likelihood estimators (if the true model, i.e. the true distribution of the noise 
factors Z t is known) are likely to perform better than the least squares estimators. 
In this respect time series analysis is not different from the situation for replicated 
experiments. An important difference is that in practice non-Gaussianity may be difficult 
to detect, other plausible distributions difficult to motivate, and other likelihoods may 
yield computational problems. 

12.3.1 Whittle Estimators 

Because the Gaussian likelihood function of a mean zero time series depends on the 
autocovariance function only, it can be helpful to write in terms of the spectral density. 
The covariance matrix of a vector {X\, ... , X n ) belonging to a stationary time series X t 
with spectral density fx can be written as T n (fx), for 



r n (/) = (f e^fWdx) 



,t=l,...,n 

Thus if the time series X t has spectral density fe under the parameter 6 and mean zero, 
then the log Gaussian likelihood can be written in the form 

--log(27r) - Ilogdetr„(/) - \XlT n {fe)- l X n . 

Maximizing this expression over 6 is equivalent to maximizing the Gaussian likelihood 
as discussed previously, but gives a different perspective. For instance, to fit an ARMA 
process we would maximize this expression over all "rational spectral densities" of the 
form CT 2 |0(e- a )| 2 /U(e-« A )| 2 '. 
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The true advantage of writing the likelihood in spectral notation is that it suggests 
a convenient approximation. The Whittle approximation is defined as 

- n .o 6 (2,)-£/jo 6 /»(A )<tt -££a§A 

where I n (X) is the periodogram of the time series X t , as defined in Section 11.2. This ap- 
proximation results from the following approximations, for a sufficiently regular function 

^ ill 

-logdetr n (/) «log(27r) + -!- f log/(A)dA, 
n 2n J_ v 



n 
combined with the identity 



^T n (f)X n = 2tt j I n (X)f(X)d\. 



The approximations are made precise in Lemma ?, whereas the identity follows by some 
algebra. 

12.11 EXERCISE. Verify the identity in the preceding display. 

The Whittle approximation is both more convenient for numerical manipulation and 
more readily amenable to theoretical analysis. The point of maximum 6 n of the Whittle 
approximation, if it exists, is known as the Whittle estimator. Conceptually, this again 
comes down to a search in the class of spectral densities fe defined through the model. 

12.12 Example (Auto-regressive process). For an auto-regressive time series of fixed 
order the Whittle estimators are identical to the Yule- Walker estimators, which are also 
(almost) identical to the maximum Gaussian likelihood estimators. This can be seen as 
follows. 

The Whittle estimators are defined by maximizing the Whittle approximation over 
all spectral densities of the form fe(X) = <r 2 /|(/>(e~ 2A )| , for <j> the auto-regressive poly- 
nomial <j>(z) = 1 — (j}\z (f> P z p . By the Kolmogorov-Szego formula (See ?), or direct 

computation, J log fe(X) dX = 2n \og(a 2 /2n) is independent of the parameters cf>i , . . . , (f> p . 
Thus the stationary equations for maximizing the Whittle approximation with respect 
to the parameters take the form 

-/[-^)-*»y~]«»« 

= -2 Re I [e iXk - ^e - "**- 1 ) (f> p e iX ^ k - p ^ J„(A) dX 

= -2 Re [7: (A;) - <k%(k - 1) <t> P %(k-pj\, 
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because 7*(/i) = n _1 Ylt=i Xt+hX t are the Fourier coefficients of the function I n for 
< h < n, by (11.2). Thus the stationary equations are the Yule-Walker equations, 
apart from the fact that the observations have been centered at mean zero, rather than 
X n . n 

12.13 EXERCISE. Derive the Whittle estimator for a 2 for an autoregressive process. 

If we write I n (f) for / I n (X)f(X) dX, then a Whittle estimator is a point of minimum 
of the map 



6^M n (6) = J log/«(A)dA + J n (-^). 



In Section 11.4 it is shown that the sequence ^Jn{I n {f) — J ffxdX) is asymptotically 
normally distributed with mean zero and some variance 0" 2 (/), under some conditions. 
This implies that the sequence M n (6) converges for every fixed 9 in probability to 



M(6) = j\ogf e (X)dX + j f j-dX. 




By reasoning as in Section 12.1 we expect that the Whittle estimators 6 n will be asymp- 
totically consistent for the parameter O that minimizes the function 6 i-> M(6). 

If the true spectral density fx takes the form fe for some parameter O , then this 
parameter is the minimizing value. Indeed, by the inequality — log x+ (x — 1) > (\fx — l) , 
valid for every x > 0, 

M{6) - M(6» ) = /(log A ( A ) + If. - i) dX > 
J v Je Je 

This shows that the function 6 i-> M(6) possesses a minimum value at 6 = 6o, and this 
point of minimum is unique as soon as the parameter 6 is identifiable from the spectral 
density. 

To derive the form of the limit distribution of the Whittle estimators we replace 
M{6) by its linear approximation, as in Section 12.1, and obtain that 

y/Z(e n - e ) = -{M n {e n )Y l V^M n {e ). 

Denoting the gradient and second order derivative matrix of the function 6 i-> log fe (A) 
by £e(X) and £g(X), we can write 



v^M„(0) = J ie(X)dX - In(jr), 

,(fl) = jle(X)dX + I n { ld "~ £e ). 



By the results of Section 11.4 the sequence y/nM n (6o) converges in distribution to a 
normal distribution with mean zero and variance & 2 (£e /fe ), under some conditions. 
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Furthermore, the sequence M n {6 ) converges in probability to J £g £j (X) dX =:Jg . If 
both are satisfied, then we obtain that 



yfr(6 n - Go) - JV(0, J eo l a 2 {i>eJfe )J eo l ). 



The asymptotic covariance is of the "sandwich form" , but reduces to a simpler expres- 
sion in the case that the time series X t is Gaussian, and the Whittle likelihood is an 
approximation for the correctly specified likelihood. In this case, 



a 2 (/)=47r J ff T (X)f x (X)dX. 



It follows that in the case, and with fx = fg , the asymptotic covariance of the sequence 
^/nM n (8 ) reduces to 4nJg , and the sandwich covariance reduces to 4-kJ^ 1 . 

12.14 Example (ARMA). The log spectral density of a stationary, causal, invertible 
ARMA(p, q) process with parameter vector 8 = (a 2 ,</>i,...,</> p ,8i,...,8 q ) can be written 
in the form 

log MA) = log a 2 - log(27r) +log^(e a ) +log^(e- a ) - \og(f>(e iX ) - \og(j>(e- iX ). 

Straightforward differentiation shows that the gradient of this function is equal to 

le(X) 



( 

«Afc — i\h 

£(e ,A ) ~*~ <Ke-* A ) 

e i\l e ~ ixl 



Here the second and third lines of the vector on the right are abbreviations of vectors of 
length p and q, respectively, obtained by letting k and / range over the values l,...,p 
and l,...,q, respectively. The matrix Jg = J £g£j(X) dX takes the form 



Jg = | AR MAAR 

MAAR? MA 

where AR, MA, and MAAR are matrices of dimensions (pxp), (q x q) and (p x q), 
respectively, which are described in more detail in the following. The zeros must be 
replicated to fulfil the dimension requirements, and result from calculations of the type, 
for k > 1, 

„i\k -i p ~k — 1 



—— dX=- —— dz = 

0(e« A ) i 4, =1 4>{z) 



\z\ = 

by Cauchy's theorem, because the function z •-> z k ~ x /(f>(z) is analytic on a neighbourhood 
of the unit disc, by the assumption of causility of the ARMA process. 
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Using the identity (f + f)(g + g) = 2Re(fg + fg) we can compute the (k, Z)-element 
of the matrix MA as 

2Re i lfl(e« A ) fl(e« A ) + e(e^)6(e-^)\ dX 

= 2 H + Jw4t dX ) = 2 -rv(k-l)2n, 



where V t is a stationary auto-regressive process satisfying 6(B)V t = Z t for a white noise 
process Z t of unit variance. The matrix AR can be expressed similarly as the covariance 
matrix oip consecutive elements of an auto-regressive process U t satisfying (f>(B)U t = Z t . 
The (A;, Z)-element of the matrix MAAR can be written in the form 



Here fuvW = l/(2-K(f>{e lX )d{e^ lX )) is the cross spectral density of the auto-regressive 
processes U t and V t defined previously (using the same white noise process Z t )(l). Hence 
the integral on the far left is equal to 2-k times the complex conjugate of the cross 
covariance 7c/y(A; — /). 

Taking this all together we see that the matrix resulting from deleting the first row 
and first column from the matrix Jg/(4n) results in the matrix J? g that occurs in The- 
orem 12.10. Thus the Whittle estimators and maximum Gaussian likelihood estimators 
have asymptotically identical behaviour. 

The Whittle estimator for a 2 is asymptotically independent of the estimators of the 
remaing parameters. □ 



12.3.2 Gaussian Time Series 

In this section we study the behaviour of the maximum likelihood estimators for general 
Gaussian time series in more detail. Thus 6 n is the point of maximum of (12.5) (or 
equivalently (12.6)), and we study the properties of the sequence \fn{8 n — 0) under the 
assumption that the true density of (Xi, .. ., X n ) possesses the form (12.5), for some 6. 
For simplicity we assume that the time series is centered at mean zero, so that the model 
is completely parametrized by the covariance matrix T n {6). Equivalently, it is determined 
by the spectral density fg, which is related to the covariance matrix by 



(r„(0)) = f e^-^fe(X)d\. 

J —-K 



It is easier to express conditions and results in terms of the spectral density fg, which is 
fixed, than in terms of the sequence of matrices T n (6). The asymptotic Fisher information 
for 6 is defined as 



l6 -^]_„-do- {x) \-do- {x) ) dx - 
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12.15 Theorem. Suppose that X t is a Gaussian time series with zero mean and spectral 
density fe such that the map 6 i-> fe is one-to-one and the map (6,X) i-> fe(A) is 
three times continuously differentiate and strictly positive. Suppose that 6 ranges over 
a bounded, open subset of M. d . Then the maximum likelihood estimator 6 n based on 
Xi,...,X n satisfies y/n(6 n - 6) ■*■* N(0, I e l ). 

Proof. See Azencott and Dacunha-Castelle (1984), Chapitre XIII. ■ 

The theorem is similar in form to the theorem for maximum likelihood estimators 
based on replicated experiments. If p ny e is the density of X n = {X\, . . . ,X n ) (given in 
(12.5)), then it can be shown under the conditions of the theorem that 

/„,<?: = -E e -gT\ogp nt e(X n )^—\ogp n} e(X n )j ->■ I e . 

The left side of this display is the true Fisher information for 6 based on X n , and this 
explains the name asymptotic Fisher information for Ig. With this in mind the analogy 
with the situation for replicated experiments is perfect. 



12.4 Model Selection 

In the preceding sections and chapters we have studied estimators for the parameters of 
ARM A or GARCH processes assuming that the orders p and q are known a-priori. In 
practice reasonable values of p and q can be chosen from plots of the auto-correlation 
and the partial auto-correlation functions, followed by diagnostic checking after fitting a 
particular model. Alternatively (or in addition) we can estimate appropriate values of p 
and q and the corresponding parameters simultaneously from the data. The maximum 
likelihood method must then be augmented by penalization. 

The value of the likelihood (12.1) depends on the dimension the parameter 6. If 
models of different dimension are available, then we can make the dependence explicit 
by denoting the log likelihood as, with d the dimension of the model, 



A„(fl,d) = ^logp M (X t |X t _ 1 ,...,X 1 ). 



t=i 

A first idea to select a reasonable dimension is to maximize the function A n jointly over 
(6, d). This rarely works. The models of interest are typically nested in that a model of 
dimension d is a submodel of a model of dimension d+1. The maximum over {6, d) is then 
taken for the largest possible dimension. To counter this preference for large dimension 
we can introduce a penalty function. Instead of A n we maximize 

(0,d).->A n (0,d)-0 n (d), 

where cj> n is a fixed function that takes large values for large values of its argument. 
Maximizing this function jointly over (0, d) must strike a balance between maximizing 



196 12: Maximum Likelihood 

A„, which leads to big values of d, and minimizing (f> n , which leads to small values of d. 
The choice of penalty function is crucial for this balance to yield good results. 

Several penalty functions are in use, each of them motivated by certain consider- 
ations. There is no general agreement as to which penalty function works best, partly 
because there are several reasonable criteria for "best". Three examples for models of 
dimension d are 

AIC(d) = d, 

nd 



BIC(d) = \d\ogn. 

The abbreviations are for Akaike's Information Criterion, Akaike's information corrected 
criterion, and Bayesian Information Criterion respectively. 

It seems reasonable to choose a penalty function such that as n — > oo the value d n 
that maximizes the penalized likelihood converges to the true value (in probability or 
almost surely). By the following theorem penalties such that <j> n (d) — > oo faster than 
loglogn achieve this aim in the case of ARMA processes. Here an ARMA(p, q) process is 
understood to be exactly of orders p and q, i.e. the leading coefficients of the polynomials 
<j> and 6 of degrees p and q are nonzero. 

12.16 Theorem. Let X t be a Gaussian causal, invertible stationary ARMA(p ,q ) pro- 
cess and let (6,p,q) maximize the penalized likelihood over Up+ 9 <d o (0 Pj9 ,p, q), where 
for each (p,q) the set Pj9 is a compact subset ofW +q+1 consisting of parameters of a 
causal, invertible stationary ARM A{p, q) process anddo > i?o+9o is fixed. If(f> n (d)/n — > 
and liminf (f> n (d) /loglogn is sufficiently large for every d < d , then p — > p and q — > qo 
almost surely. 

Proof. See Azencott and Dacunha-Castelle (1984), Chapitre XIV. ■ 

The condition on the penalty is met by the BIC penalty, but not by Akaike's penalty 
function. It is observed in practice that the use of Akaike's criterion overestimates the 
order of the model. The AICC criterion, which puts slightly bigger penalty on big models, 
is an attempt to correct this. 

However, choosing a model of the correct order is perhaps not the most relevant 
criterion for "good" estimation. A different criterion is the distance of the estimated 
model, specified by a pair of a dimension d and a corresponding parameter 6, to the 
true law of the observations. Depending on the distance used, an "incorrect" estimate 
d together with a good estimate 6 of that dimension may well yield a model that is 
closer than the estimated model of the correct (higher) dimension. This paradox arises 
because fitting a model of higher dimension requires the estimation of more parameters, 
which may result in poorer estimators of all parameters. (Cf. Section 10.1.1.) For ARMA 
processes the AIC criterion performs well in this respect. 

The AIC criterion is based on the Kullback-Leibler distance. 
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12.17 EXERCISE. Repeatedly simulate a MA(1) process with 6 = .8 for n = 50 or 
n = 100. 

(i) Compare the quality of the moment estimator and the maximum likleihood estima- 
tor, 
(ii) Are the sampling distributions of the estimators approximately normal? 

12.18 EXERCISE. Find best fitting AR and ARM A models for the Wolfer sunspot 
numbers (object sunspots in Splus). 



